1. Document Conversion
Convert Office documents and PDFs to Markdown while preserving structure.
Supported formats:
- PDF files (with optional Azure Document Intelligence integration)
- Word documents (DOCX)
- PowerPoint presentations (PPTX)
- Excel spreadsheets (XLSX, XLS)
Basic usage:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```
Command-line:
```bash
markitdown document.pdf -o output.md
```
See references/document_conversion.md for detailed documentation on document-specific features.
2. Media Processing
Extract text from images using OCR and transcribe audio files to text.
Supported formats:
- Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
- Audio files with speech transcription (requires speech_recognition)
Image with OCR:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("image.jpg")
print(result.text_content) # Includes EXIF metadata and OCR text
```
Audio transcription:
```python
result = md.convert("audio.wav")
print(result.text_content) # Transcribed speech
```
See references/media_processing.md for advanced media handling options.
3. Web Content Extraction
Convert web-based content and e-books to Markdown.
Supported formats:
- HTML files and web pages
- YouTube video transcripts (via URL)
- EPUB books
- RSS feeds
YouTube transcript:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
```
See references/web_content.md for web extraction details.
4. Structured Data Handling
Convert structured data formats to readable Markdown tables.
Supported formats:
- CSV files
- JSON files
- XML files
CSV to Markdown table:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content) # Formatted as Markdown table
```
See references/structured_data.md for format-specific options.
5. Advanced Integrations
Enhance conversion quality with AI-powered features.
Azure Document Intelligence:
For enhanced PDF processing with better table extraction and layout analysis:
```python
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="", docintel_key="")
result = md.convert("complex.pdf")
```
LLM-Powered Image Descriptions:
Generate detailed image descriptions using GPT-4o:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx") # Images described with LLM
```
See references/advanced_integrations.md for integration details.
6. Batch Processing
Process multiple files or entire ZIP archives at once.
ZIP file processing:
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("archive.zip")
print(result.text_content) # All files converted and concatenated
```
Batch script:
Use the provided batch processing script for directory conversion:
```bash
python scripts/batch_convert.py /path/to/documents /path/to/output
```
See scripts/batch_convert.py for implementation details.