Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption. It is designed for developers and data scientists working with PDFs.

132210Updated Today

Text Extraction

Quick InstallView Source

$ npx skills add https://github.com/letta-ai/skills --skill extracting-pdf-text

Overview

About This Skill

The PDF text extraction skill enables developers and data scientists to extract readable text from PDF documents in formats suitable for language model consumption. Located in the tools/pdf category of the Letta skills repository, it provides practical guidance and tools for handling PDF processing workflows. The skill addresses the common challenge of preparing PDF content for AI agents and LLMs by offering extraction methods that preserve document structure while removing formatting noise. This skill integrates with Letta Code and Claude Code frameworks through the skills system, allowing agents to dynamically load PDF processing capabilities on demand.

How to Use

Install via npm using the command `$ npx skills add https://github.com/letta-ai/skills --skill extracting-pdf-text`.

Use Cases

Extract simple text from PDF documents using PyMuPDF.

Retrieve tabular data from PDFs with pdfplumber.

Process scanned or image-based PDFs using OCR.

Implement a complete RAG pipeline using marker-pdf.

Setup & Installation

Quick Install

Terminal

$ npx skills add https://github.com/letta-ai/skills --skill extracting-pdf-text

Alternative Install (Git Clone)

git clone https://github.com/letta-ai/skills

Requirements

Claude Code or compatible AI agent

Quick Start Guide

Install the Skill

Copy the install command above and run it in your terminal.

Open Your AI Agent

Launch Claude Code, Cursor, or your preferred AI coding agent.

Try It Out

Use the prompt template or examples below to test the skill.

Customize

Adapt the skill to your specific use case and workflow.

Usage Examples

Prompt Template

Extract all readable text from the PDF at [PDF_FILE_PATH] and format it as plain text with clear section headers. Preserve the reading order and ignore non-text elements like images or tables. If the PDF is scanned or image-based, first convert it to text using OCR. Return the raw text in a single block without commentary.

Example Output

```
# Annual Financial Report 2023
## Executive Summary
Revenue for [COMPANY] grew by 12% in 2023, reaching $45.2M, driven by expansion in the [INDUSTRY] sector. Key highlights include the launch of Product X in Q3 and a 5% increase in customer retention.

## Financial Highlights
- **Total Revenue:** $45.2M (+12% YoY)
- **Net Profit:** $8.7M (+18% YoY)
- **Operating Expenses:** $22.1M (within budget)

## Market Analysis
The [INDUSTRY] market saw a 3% decline due to regulatory changes, but [COMPANY] mitigated impact through strategic partnerships.
```

Find the right skills for your stack

Take a free 3-minute scan and get personalized AI skill recommendations.

Take free scan

Overview

About This Skill

How to Use

Install via npm using the command `$ npx skills add https://github.com/letta-ai/skills --skill extracting-pdf-text`.

Use Cases

Extract simple text from PDF documents using PyMuPDF.

Retrieve tabular data from PDFs with pdfplumber.

Process scanned or image-based PDFs using OCR.

Implement a complete RAG pipeline using marker-pdf.

Quick Install

Terminal

$ npx skills add https://github.com/letta-ai/skills --skill extracting-pdf-text

Alternative Install (Git Clone)

git clone https://github.com/letta-ai/skills

Requirements

Claude Code or compatible AI agent

Quick Start Guide

Install the Skill

Copy the install command above and run it in your terminal.

Open Your AI Agent

Launch Claude Code, Cursor, or your preferred AI coding agent.

Try It Out

Use the prompt template or examples below to test the skill.

Customize

Adapt the skill to your specific use case and workflow.

Usage Examples

Prompt Template

Extract all readable text from the PDF at [PDF_FILE_PATH] and format it as plain text with clear section headers. Preserve the reading order and ignore non-text elements like images or tables. If the PDF is scanned or image-based, first convert it to text using OCR. Return the raw text in a single block without commentary.

Example Output

```
# Annual Financial Report 2023
## Executive Summary
Revenue for [COMPANY] grew by 12% in 2023, reaching $45.2M, driven by expansion in the [INDUSTRY] sector. Key highlights include the launch of Product X in Q3 and a 5% increase in customer retention.

## Financial Highlights
- **Total Revenue:** $45.2M (+12% YoY)
- **Net Profit:** $8.7M (+18% YoY)
- **Operating Expenses:** $22.1M (within budget)

## Market Analysis
The [INDUSTRY] market saw a 3% decline due to regulatory changes, but [COMPANY] mitigated impact through strategic partnerships.
```

Extracting PDF Text for LLMs

Overview

About This Skill

How to Use

Use Cases

Tags

Setup & Installation

Quick Install

Alternative Install (Git Clone)

Requirements

Quick Start Guide

Install the Skill

Open Your AI Agent

Try It Out

Customize

Usage Examples

Prompt Template

Example Output

Find the right skills for your stack

Extracting PDF Text for LLMs

Overview

About This Skill

How to Use

Use Cases

Tags

Setup & Installation

Quick Install

Alternative Install (Git Clone)

Requirements

Quick Start Guide

Install the Skill

Open Your AI Agent

Try It Out

Customize

Usage Examples

Prompt Template

Example Output

Find the right skills for your stack