Extracting Course Content from GoHighLevel

GoHighLevel markets itself as an all-in-one CRM, and one of the features it includes is a course hosting module. You upload your lessons, add some images, attach quizzes, and your content lives inside their walled garden. The problem comes when you want to leave. There is no export button for course content. If you want to move your materials to a static site or a custom setup, you have to go in and get it yourself.

I recently built a scraper to pull course data out of GoHighLevel for exactly this reason. Two courses needed to migrate off the platform to a completely custom setup. The courses contained standard lesson content with text and images, plus interactive quizzes. The extraction required figuring out the platform structure, handling authentication, downloading media, and patching references. Here is how it worked.

The first challenge was simply discovering all the content. GoHighLevel does not give you a sitemap or an API endpoint that lists every lesson in a course. You get a course ID, and from there you have to traverse the internal API to find modules and then the individual lessons inside those modules. The scraper had to authenticate with the platform, grab the course structure, and recursively walk through every module to collect the lesson URLs. Each lesson required a separate API call to fetch the actual content body.

Once I had the lesson content, the next problem was images. GoHighLevel stores uploaded images on Google Cloud Storage. That part is standard. The unusual detail is the file naming convention. Every image gets a UUID as its filename with no file extension. You end up with URLs pointing to something like https://storage.googleapis.com/bucket/550e8400-e29b-41d4-a716-446655440000 with no indication of whether that is a PNG, JPEG, GIF, or WebP.

The scraper had to download each image and determine the actual MIME type from the response headers. Using the Content-Type header from the HTTP response, I could map the binary data back to the correct extension. The downloaded file would then be saved locally with the UUID name plus the proper extension. Inside the lesson HTML, every src attribute pointing to a GCS URL needed to be rewritten to point to the new local file path. This was straightforward string replacement, but it required careful parsing of the HTML to make sure I was only touching image tags and not breaking other attributes.

The two courses had different content profiles. One course was image-heavy with 19 lessons containing 109 images total. The other course was shorter on text and images, just 4 lessons with 6 images, but it included 43 quiz files. That difference in quiz volume shaped how the extraction logic had to work.

Quizzes in GoHighLevel are not standard HTML forms. They are interactive JavaScript widgets. When you look at the lesson source, the quiz is represented by a script tag or a custom element that renders the interactive experience at runtime. There is no static HTML fallback and no JSON payload sitting in the DOM that you can just scrape and reuse. The quiz data exists somewhere in their backend, but the lesson content endpoint does not expose it in a usable format.

Attempting to parse the JavaScript widgets would have been fragile and ultimately pointless. The interactive functionality depends on GoHighLevel’s frontend code, so even if I extracted the widget configuration, it would not work outside their environment. The goal was migration to a static setup, so the quizzes needed to be rebuilt anyway.

The solution was to handle quiz files gracefully during extraction. When the scraper detected a quiz widget in the lesson content, it stripped out the JavaScript and replaced it with a placeholder comment. The placeholder noted that an interactive quiz existed at that position in the original content. This way, the lesson structure stayed intact, the text content remained in the correct order, and whoever rebuilt the quizzes in the new system knew exactly where each one belonged.

The extraction script itself was written in Python using requests for HTTP calls and BeautifulSoup for HTML parsing. Authentication required passing session cookies obtained from a browser login. GoHighLevel uses token-based auth, so the scraper needed to grab the session token first and then attach it to every subsequent API request.

The main loop worked in stages. First, fetch the course outline and build a list of module IDs. Second, iterate through each module to get lesson IDs. Third, request each lesson’s content body. Fourth, parse the HTML, extract image URLs, download the images, and rewrite the src attributes. Fifth, detect quiz widgets and insert the placeholder comments. Finally, save the cleaned HTML to local files organized by module and lesson order.

Downloading 109 images from GCS was not slow, but it did require connection pooling to avoid hammering the server with sequential requests. I used requests.Session to keep connections alive and added a small delay between downloads. Every downloaded image was verified by checking that the response status was 200 and that the Content-Length was greater than zero before writing to disk.

One issue that came up during image processing was duplicate images. The same image sometimes appeared in multiple lessons. Rather than downloading duplicates, the scraper maintained a set of already-seen UUIDs. If an image UUID had been processed in a previous lesson, the scraper would skip the download and just update the HTML reference to point to the existing local file. This cut down on redundant network requests and saved disk space.

The HTML rewriting step needed to handle both absolute and relative URL patterns. Most image references were absolute URLs pointing directly to the GCS bucket, but a few were relative paths that resolved against the GoHighLevel domain. The scraper normalized all paths to absolute URLs first, checked them against the GCS pattern, and then performed the local file substitution. Any URL that did not match the known GCS bucket was left alone to avoid breaking external links.

The final output was a directory structure mirroring the course layout. Each module became a folder, each lesson became an HTML file inside that folder, and images went into a shared assets directory. The placeholder comments for quizzes followed a consistent format so they could be searched and replaced programmatically later.

Migrating off a closed platform always involves this kind of reverse engineering work. GoHighLevel does not make course extraction difficult on purpose. They simply did not build an export feature because they assume you will stay. When you need to leave, you have to read the network traffic, understand their API patterns, and build your own extraction pipeline. The technical details are not complicated, but you have to account for edge cases like extensionless files and JavaScript widgets that refuse to sit still for a scraper. Handle those gracefully, and you get your content back in a clean format that works anywhere.

Navigation

Extracting Course Content from GoHighLevel

More writing

Choosing the Right Whisper Transcription Engine for Apple Silicon

The 404s Google Found Before I Did

Why I Switched from Next.js to Astro