Chore (refactor): support table extraction with pre-computed ocr data #1801

yuming-long · 2023-10-19T17:03:23Z

Summary

Table OCR refactor, move the OCR part for table model in inference repo to unst repo.

Before this PR, table model extracts OCR tokens with texts and bounding box and fills the tokens to the table structure in inference repo. This means we need to do an additional OCR for tables.
After this PR, we use the OCR data from entire page OCR and pass the OCR tokens to inference repo, which means we only do one OCR for the entire document.

Tech details:

Combined env ENTIRE_PAGE_OCR and TABLE_OCR to OCR_AGENT, this means we use the same OCR agent for entire page and tables since we only do one OCR.
Bump inference repo to 0.7.9, which allow table model in inference to use pre-computed OCR data from unst repo. Please check in PR.
All notebooks lint are made by make tidy
This PR also fixes issue, I've added test for the issue in test_pdf.py::test_partition_pdf_hi_table_extraction_with_languages
Add same scaling logic to image similar to previous Table OCR, but now scaling is applied to entire image

Test

Not much to manually testing expect table extraction still works
But due to change on scaling and use pre-computed OCR data from entire page, there are some slight (better) changes on table output, here is an comparison on test outputs i found from the same test test_partition_image_with_table_extraction:

screen shot for table in layout-parser-paper-with-table.jpg:

before refactor:

after refactor:

TODO

(added as a ticket) Still have some clean up to do in inference repo since now unst repo have duplicate logic, but can keep them as a fall back plan. If we want to remove anything OCR related in inference, here are items that is deprecated and can be removed:

get_tokens (already noted in code)
parameter extract_tables in inference
interpret_table_block
load_agent
env TABLE_OCR

Note

if we want to fallback for an additional table OCR (may need this for using paddle for table), we need to:

pass infer_table_structure to inference with extract_tables parameter
stop passing infer_table_structure to ocr.py

yuming-long · 2023-10-20T17:02:45Z

...put/local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.json

    "metadata": {
      "data_source": {},
      "filetype": "image/jpeg",
      "page_number": 1
    },
-    "text": "layout data structures, which are optimized for efficiency and versatility. 3) When necessary, users can employ existing or customized OCR models via the unified API provided in the OCR module. 4) LayoutParser comes with a set of utility fanctions for the visualization and storage of the layout data. 5) LayoutParser is also highly customizable, via its integration with functions for layout data annotation and model training We now provide detailed descriptions for each component."
+    "text": "For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101 backbones [13], respectively. One can train models of different architectures, like Fuster R-CNN [28] (P) and Mask R-CNN [12] (M). For example, an F in the Large Model column indicates it has m Faster R-CNN model trained using the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model zoo in coming months."


scaling on entire image improved ingest output, small font texts are now detected

- function wrapper tries to use `cast` to convert kwargs into `str` but when a value is `None` `cast(str, None)` still returns `None` - fix replaces the conversion to simply using `str()` function call

…reaks-api

…into yuming/table_ocr_factor

…ks-api' into yuming/table_ocr_factor" This reverts commit 65c13bf, reversing changes made to 03da255.

…sion on entire doc OCR (#1850) ### Summary A follow up ticket on #1801, I forgot to remove the lines that pass extract_tables to inference, and noted the table regression if we only do one OCR for entire doc **Tech details:** * stop passing `extract_tables` parameter to inference * added table extraction ingest test for image, which was skipped before, and the "text_as_html" field contains the OCR output from the table OCR refactor PR * replaced `assert_called_once_with` with `call_args` so that the unit tests don't need to test additional parameters * added `error_margin` as ENV when comparing bounding boxes of`ocr_region` with `table_element` * added more tests for tables and noted the table regression in test for partition pdf ### Test * for stop passing `extract_tables` parameter to inference, run test `test_partition_pdf_hi_res_ocr_mode_with_table_extraction` before this branch and you will see warning like `Table OCR from get_tokens method will be deprecated....`, which means it called the table OCR in inference repo. This branch removed the warning.

yuming-long added 30 commits October 16, 2023 15:15

moving infer_table_structure to ocr

95972cf

lint

c54b07e

merge two ocr env, doc nit

531c9ac

Merge branch 'main' into yuming/table_ocr_factor

3918a87

update test

11b8efa

make tidy lint

ce91fa4

prepare structure needed for table ocr

3ae44fd

small nit on import

d94ca5e

bump inference to 0.7.9 to use optional ocr_tokens

5de02ae

logic for getting table tokens

b902747

image scaling for tesseract ocr

47b9e3d

fix some broken tests

c72f90d

fix bug can't set attribute

091c720

Merge branch 'main' into yuming/table_ocr_factor

1105370

forgot to pass in zoom parm

5362b42

pass np array to aviod Corrupt JPEG data error

24cd20f

tesseract ocr change title index

5322343

more test update due to index change

32d7680

idx change in output dut to scaling

4fee933

update commented tests

7e5e878

more index update

09f02e3

two more todo to go :)

f6e632a

table test for ocr mode

6187200

enhance test; table only return strcture

b490e9e

stage for debugging

c67dbf7

fixed borken text where table token is mismatched

3bbe150

note for table token

d1fb1cd

add test for korean table

0a2b7a1

for coverage

33c438f

Merge branch 'main' into yuming/table_ocr_factor

4554e94

getting ingest update locally

9b6641e

yuming-long commented Oct 20, 2023

View reviewed changes

yuming-long and others added 5 commits October 20, 2023 17:22

should be on ec2 tho

8053e60

fix: model_name being None raises attribution error

52d212d

- function wrapper tries to use `cast` to convert kwargs into `str` but when a value is `None` `cast(str, None)` still returns `None` - fix replaces the conversion to simply using `str()` function call

Merge remote-tracking branch 'origin/main' into fix/none-model-name-b…

28cb79c

…reaks-api

ec2 docker ingest update

f06dfea

Merge branch 'main' into yuming/table_ocr_factor

f6f16e4

yuming-long mentioned this pull request Oct 20, 2023

Missing text "Signature" in image output after entire page OCR refactor #1813

Closed

yuming-long linked an issue Oct 20, 2023 that may be closed by this pull request

Missing text "Signature" in image output after entire page OCR refactor #1813

Closed

yuming-long and others added 2 commits October 20, 2023 17:56

Merge branch 'main' into yuming/table_ocr_factor

03da255

Merge remote-tracking branch 'origin/fix/none-model-name-breaks-api' …

65c13bf

…into yuming/table_ocr_factor

yuming-long enabled auto-merge October 20, 2023 22:12

Revert "Merge remote-tracking branch 'origin/fix/none-model-name-brea…

71b6f61

…ks-api' into yuming/table_ocr_factor" This reverts commit 65c13bf, reversing changes made to 03da255.

yuming-long disabled auto-merge October 20, 2023 22:17

Merge branch 'main' into yuming/table_ocr_factor

99487e0

yuming-long enabled auto-merge October 20, 2023 22:40

yuming-long added this pull request to the merge queue Oct 20, 2023

yuming-long removed this pull request from the merge queue due to a manual request Oct 20, 2023

yuming-long and others added 2 commits October 20, 2023 19:47

Merge branch 'main' into yuming/table_ocr_factor

b898eb1

Merge branch 'main' into yuming/table_ocr_factor

db62ac8

yuming-long enabled auto-merge October 20, 2023 23:51

version and changlog

ff9102c

yuming-long added this pull request to the merge queue Oct 21, 2023

Merged via the queue into main with commit ce40cdc Oct 21, 2023

yuming-long deleted the yuming/table_ocr_factor branch October 21, 2023 00:59

This was referenced Oct 23, 2023

chore(deps): bump unstructured to 0.10.25 Unstructured-IO/unstructured-api#291

Merged

Chore: stop passing extract_tables to inference and note table regression on entire doc OCR #1850

Merged

bug/table output regression after table OCR refactor #1853

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chore (refactor): support table extraction with pre-computed ocr data #1801

Chore (refactor): support table extraction with pre-computed ocr data #1801

yuming-long commented Oct 19, 2023 •

edited

Loading

yuming-long Oct 20, 2023

Chore (refactor): support table extraction with pre-computed ocr data #1801

Chore (refactor): support table extraction with pre-computed ocr data #1801

Conversation

yuming-long commented Oct 19, 2023 • edited Loading

Summary

Test

TODO

Note

yuming-long Oct 20, 2023

Choose a reason for hiding this comment

yuming-long commented Oct 19, 2023 •

edited

Loading