Resume Parsing Algorithm: Technical Overview
This section provides a comprehensive exploration of the resume parsing algorithm developed by our organization. It outlines the four-step process designed to extract structured data from single-column English-language resumes.
Step 1: Extract Text Items from PDF
The PDF format, standardized under ISO 32000, encodes content in a complex structure. To process a resume, our parser decodes the PDF using Mozilla's open-source pdf.js library to extract text items, including their content and metadata such as x, y coordinates, bold formatting, and line breaks.
The table below displays 0 text items extracted from the provided resume PDF. Each item includes metadata such as position (relative to the bottom-left corner at origin 0,0), boldness, and newline indicators.
Step 2: Group Text Items into Lines
Extracted text items require further processing to address two challenges:
Challenge 1: FragmentationText items, such as phone numbers (e.g., "(123) 456-7890"), may be split into multiple fragments. To resolve this, adjacent items are merged if their horizontal distance is less than the average character width, calculated as:The average character width excludes bolded text and newlines to ensure accuracy.
Challenge 2: Lack of ContextRaw text items lack the contextual associations humans infer from visual cues. Our parser groups items into lines, mimicking human reading patterns, to establish these relationships.
The result is 0 lines, displayed below. Multiple text items within a line are separated by a vertical divider.
Step 3: Group Lines into Sections
Building on line grouping, this step organizes lines into sections to enhance contextual understanding. Most sections begin with a title, a common convention in resumes.
Section titles are identified using a primary heuristic requiring:
1. A single text item in the line
2. Bold formatting
3. All uppercase letters
A fallback heuristic uses keyword matching against common resume section titles if the primary criteria are not met.
The table below shows identified sections, with titles in bold and associated lines highlighted in matching colors.
Step 4: Extract Resume Data from Sections
The final step extracts structured resume data using a feature-scoring system. Each resume attribute is evaluated against custom feature sets, which assign positive or negative scores based on matching criteria. The text item with the highest score is selected as the attribute value.
Feature Scoring System
The table below illustrates three attributes extracted from the profile section of the provided resume, showing the highest-scoring text and scores for other candidates.
Resume Attribute | Text (Highest Feature Score) | Feature Scores of Other Texts |
---|---|---|
Name | ||
Phone |
Feature Sets
Feature sets are crafted based on two principles:
1. Relative comparison to other attributes in the same section
2. Manual design reflecting attribute characteristics
The table below details feature sets for the name attribute, including positive scores for matches and negative scores for non-matches.
Name Feature Sets | |
---|---|
Feature Function | Feature Matching Score |
Contains only letters, spaces, or periods | +3 |
Is bolded | +2 |
Contains all uppercase letters | +2 |
Contains @ | -4 (email match) |
Contains number | -4 (phone match) |
Contains , | -4 (address match) |
Contains / | -4 (URL match) |
Core Feature Functions
Each attribute relies on a core feature function for identification, as shown below.
Resume Attribute | Core Feature Function | Regex |
---|---|---|
Name | Contains only letters, spaces, or periods | /^[a-zA-Z\s\.]+$/ |
Matches email format xxx@xxx.xxx xxx can be any non-space character | /\S+@\S+\.\S+/ | |
Phone | Matches phone format (xxx)-xxx-xxxx Optional parentheses and dashes | /\(?\d{3}\)?[\s-]?\d{3}[\s-]?\d{4}/ |
Location | Matches city and state format City, ST | /[A-Z][a-zA-Z\s]+, [A-Z]{2}/ |
URL | Matches URL format xxx.xxx/xxx | /\S+\.[a-z]+\/\S+/ |
School | Contains keywords like College, University, School | |
Degree | Contains keywords like Associate, Bachelor, Master | |
GPA | Matches GPA format x.xx | /[0-4]\.\d{1,2}/ |
Date | Contains year, month, season, or 'Present' keywords | Year: /(?:19|20)\d{2}/ |
Job Title | Contains keywords like Analyst, Engineer, Intern | |
Company | Is bolded or excludes job title/date patterns | |
Project | Is bolded or excludes date patterns |
Handling Subsections
For sections like education or work experience, subsections are detected using a heuristic based on vertical line gaps (1.4x the typical gap) or bolded text. Each subsection is processed independently to extract attributes.
Authored by Farouk Jjingo, January 2025