fix bugs and simplify
This commit is contained in:
parent
6fc7da8899
commit
14734d3125
2 changed files with 111 additions and 302 deletions
20
CLAUDE.md
20
CLAUDE.md
|
|
@ -4,15 +4,19 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
|||
|
||||
## Project Overview
|
||||
|
||||
Single-script Python tool that extracts credit card transactions from BAC Costa Rica statement PDFs. Parses section "B) Detalle de compras del periodo" and outputs JSON.
|
||||
Single-script Python tool that extracts credit card transactions from BAC Costa Rica statement PDFs. Parses sections B (purchases), D (other charges), and E (voluntary services) and outputs JSON.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- pdfplumber (>=0.10.0)
|
||||
|
||||
## Usage
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Run tests
|
||||
python testStatements/run_tests.py
|
||||
|
||||
# Run extractor
|
||||
python bac_extract.py <pdf_file> [options]
|
||||
|
||||
# Examples
|
||||
|
|
@ -29,11 +33,15 @@ Options:
|
|||
|
||||
The extraction pipeline:
|
||||
1. Validates PDF is a BAC statement (`is_bac_statement`)
|
||||
2. Locates section B via regex patterns (`find_section_b_start`, `is_section_end`)
|
||||
3. Extracts tables page-by-page using pdfplumber
|
||||
4. Parses Spanish dates (D-MMM-YY format) and amounts with comma separators
|
||||
2. Iterates pages line-by-line, detecting section boundaries via `SECTIONS` dict patterns
|
||||
3. Parses transactions matching `TRANSACTION_PATTERN` regex
|
||||
4. Outputs card holders, transactions by section, and summaries
|
||||
|
||||
Key data structures:
|
||||
- `SECTIONS`: Maps section IDs (B/D/E) to start/end regex patterns and output keys
|
||||
- `SPANISH_MONTHS`: Spanish month abbreviations for date parsing
|
||||
|
||||
Key parsing functions:
|
||||
- `parse_spanish_date`: Converts "15-ENE-25" to "2025-01-15"
|
||||
- `parse_amount`: Handles "1,234.56" and trailing negatives "100.00-"
|
||||
- `extract_card_holder`: Matches "************1234 NAME" pattern
|
||||
- `matches_patterns`: Generic regex pattern matcher for section detection
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue