Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chore (refactor): support table extraction with pre-computed ocr data #1801

Merged
merged 51 commits into from
Oct 21, 2023

Conversation

yuming-long
Copy link
Contributor

@yuming-long yuming-long commented Oct 19, 2023

Summary

Table OCR refactor, move the OCR part for table model in inference repo to unst repo.

  • Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables.
  • After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document.

Tech details:

  • Combined env ENTIRE_PAGE_OCR and TABLE_OCR to OCR_AGENT, this means we use the same OCR agent for entire page and tables since we only do one OCR.
  • Bump inference repo to 0.7.9, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in PR.
  • All notebooks lint are made by make tidy
  • This PR also fixes issue, I've added test for the issue in test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages
  • Add same scaling logic to image similar to previous Table OCR, but now scaling is applied to entire image

Test

  • Not much to manually testing expect table extraction still works
  • But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test test_partition_image_with_table_extraction:

screen shot for table in layout-parser-paper-with-table.jpg:
expected
before refactor:
before
after refactor:
after

TODO

(added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed:

Note

if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to:

  • pass infer_table_structure to inference with extract_tables parameter
  • stop passing infer_table_structure to ocr.py

"metadata": {
"data_source": {},
"filetype": "image/jpeg",
"page_number": 1
},
"text": "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility fanctions for the visualization and storage of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training We now provide detailed descriptions for each component."
"text": "For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Fuster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaling on entire image improved ingest output, small font texts are now detected

yuming-long and others added 5 commits October 20, 2023 17:22
- function wrapper tries to use `cast` to convert kwargs into `str` but
  when a value is `None` `cast(str, None)` still returns `None`
- fix replaces the conversion to simply using `str()` function call
@yuming-long yuming-long enabled auto-merge October 20, 2023 22:12
…ks-api' into yuming/table_ocr_factor"

This reverts commit 65c13bf, reversing
changes made to 03da255.
@yuming-long yuming-long disabled auto-merge October 20, 2023 22:17
@yuming-long yuming-long enabled auto-merge October 20, 2023 22:40
@yuming-long yuming-long added this pull request to the merge queue Oct 20, 2023
@yuming-long yuming-long removed this pull request from the merge queue due to a manual request Oct 20, 2023
@yuming-long yuming-long enabled auto-merge October 20, 2023 23:51
@yuming-long yuming-long added this pull request to the merge queue Oct 21, 2023
Merged via the queue into main with commit ce40cdc Oct 21, 2023
@yuming-long yuming-long deleted the yuming/table_ocr_factor branch October 21, 2023 00:59
github-merge-queue bot pushed a commit that referenced this pull request Oct 24, 2023
…sion on entire doc OCR (#1850)

### Summary

A follow up ticket on
#1801, I forgot to
remove the lines that pass extract_tables to inference, and noted the
table regression if we only do one OCR for entire doc

**Tech details:**
* stop passing `extract_tables` parameter to inference
* added table extraction ingest test for image, which was skipped
before, and the "text_as_html" field contains the OCR output from the
table OCR refactor PR
* replaced `assert_called_once_with` with `call_args` so that the unit
tests don't need to test additional parameters
* added `error_margin` as ENV when comparing bounding boxes
of`ocr_region` with `table_element`
* added more tests for tables and noted the table regression in test for
partition pdf

### Test
* for stop passing `extract_tables` parameter to inference, run test
`test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this
branch and you will see warning like `Table OCR from get_tokens method
will be deprecated....`, which means it called the table OCR in
inference repo. This branch removed the warning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants