-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chore (refactor): support table extraction with pre-computed ocr data #1801
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
yuming-long
commented
Oct 20, 2023
"metadata": { | ||
"data_source": {}, | ||
"filetype": "image/jpeg", | ||
"page_number": 1 | ||
}, | ||
"text": "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility fanctions for the visualization and storage of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training We now provide detailed descriptions for each component." | ||
"text": "For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Fuster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scaling on entire image improved ingest output, small font texts are now detected
- function wrapper tries to use `cast` to convert kwargs into `str` but when a value is `None` `cast(str, None)` still returns `None` - fix replaces the conversion to simply using `str()` function call
…into yuming/table_ocr_factor
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 24, 2023
…sion on entire doc OCR (#1850) ### Summary A follow up ticket on #1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc **Tech details:** * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Table OCR refactor, move the OCR part for table model in inference repo to unst repo.
Tech details:
ENTIRE_PAGE_OCR
andTABLE_OCR
toOCR_AGENT
, this means we use the same OCR agent for entire page and tables since we only do one OCR.0.7.9
, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in PR.make tidy
test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages
Test
test_partition_image_with_table_extraction
:screen shot for table in



layout-parser-paper-with-table.jpg
:before refactor:
after refactor:
TODO
(added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed:
get_tokens
(already noted in code)extract_tables
in inferenceinterpret_table_block
load_agent
TABLE_OCR
Note
if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to:
infer_table_structure
to inference withextract_tables
parameterinfer_table_structure
toocr.py