Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for partitioning .heic files #2454

Merged
merged 12 commits into from
Jan 30, 2024
Merged

feat: add support for partitioning .heic files #2454

merged 12 commits into from
Jan 30, 2024

Conversation

Coniferish
Copy link
Collaborator

@Coniferish Coniferish commented Jan 24, 2024

.heic files are an image filetype we have not supported.

Testing

from unstructured.partition.image import partition_image

png_filename = "example-docs/DA-1p.png"
heic_filename = "example-docs/DA-1p.heic"

png_elements = partition_image(png_filename, strategy="hi_res")
heic_elements = partition_image(heic_filename, strategy="hi_res")

for i in range(len(heic_elements)):
	print(heic_elements[i].text == png_elements[i].text)

Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for @awalker4 to chime in about the unstructured-client bump, and I just have the one concern about using a different Python version to pip-compile. Otherwise LGTM! Thanks for the quick win on this!

Copy link
Collaborator

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Coniferish @qued I found that if we call register_heif_opener() before calling unstructured-inference functions, this PR works without making any changes to the unstructred-inference repo.

@christinestraub
Copy link
Collaborator

Can we move register_heif_opener() to partition_pdf_or_image() to get the "ocr_only" strategy work?

heic_elements = partition_image(heic_filename, strategy="ocr_only")

Copy link
Collaborator

@christinestraub christinestraub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@christinestraub christinestraub added this pull request to the merge queue Jan 30, 2024
Merged via the queue into main with commit db67805 Jan 30, 2024
43 checks passed
@christinestraub christinestraub deleted the jj/heic branch January 30, 2024 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants