Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(chunking): add overlap on chunk-splits #2305

Merged
merged 1 commit into from
Dec 22, 2023
Merged

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Dec 20, 2023

There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element.

So if an oversized element had text "abcd efgh ijkl mnop qrst" and was split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl" and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and applied at the beginning of the string (before "abcd") is handled in a separate operation in the next PR.

Base automatically changed from scanny/split-on-word-boundary to main December 21, 2023 06:28
@scanny scanny force-pushed the scanny/intra-chunk-overlap branch 3 times, most recently from 28cbf5c to f221528 Compare December 21, 2023 18:11
@scanny scanny marked this pull request as ready for review December 21, 2023 18:58
@christinestraub
Copy link
Collaborator

christinestraub commented Dec 22, 2023

Is the final goal of a series of refactoring PRs to implement the "intra-chunking" overlap? This PR seems to introduce a new concept of "intra-chunking" overlap so, it would probably be good to describe this "intra-chunking" overlap in the CHANGELOG file at this stage. What do you think?

@scanny
Copy link
Collaborator Author

scanny commented Dec 22, 2023

Yeah, I thought about that, but my thinking is that none of this is exposed yet (no way to engage these features from the unstructured API) so I thought it would be best to put the CHANGELOG updates in the later PR where these features are added to the API, "wired up" so to speak.

Otherwise someone reading the CHANGELOG might go looking for something they're not going to be able to find.

There's one more PR after this one to add the inter-chunk overlap capability, and then the next one (the last one) will actually add the chunk_by_character() strategy and add the overlap parameters to the decorator, the strategies, and probably in the documentation somewhere. I was thinking to put the feature additions on the CHANGELOG for that PR since that's when they'll be "published".

@scanny
Copy link
Collaborator Author

scanny commented Dec 22, 2023

Thanks @christinestraub! :)

@scanny scanny added this pull request to the merge queue Dec 22, 2023
@cragwolfe cragwolfe removed this pull request from the merge queue due to a manual request Dec 22, 2023
@cragwolfe
Copy link
Contributor

cragwolfe commented Dec 22, 2023

Otherwise someone reading the CHANGELOG might go looking for something they're not going to be able to find.

I think the CHANGELOG should reflect the state of the actual library. Historically, new functionality appears in the CHANGELOG necessarily shows up first as plain-old-partition parameters.

I do think it makes sense to have an additional CHANGELOG item later when unstructured-ingest supports, and at that point the API should too.

@scanny
Copy link
Collaborator Author

scanny commented Dec 22, 2023

@cragwolfe This intra-chunk overlap behavior does not appear as partition parameters yet. The only way to trigger the behavior is by constructing a (lib internal-use only) ChunkingOptions instance with overlap > 0. That (new-ish) object is not an interface object, it's only used (outside of tests) by chunking-strategy code, chunk_by_title() so far.

Before the user can specify overlap, we have to:

  1. Add it as a partition kwarg, in the documentation at least.
  2. Add capture and use of that kwarg to the @apply_chunking_strategy decorator.
  3. Add it as an explicit kwarg to chunk_by_title() and the new chunking strategy being added.

My plan was to postpone those bits until after the next PR which adds the second half of overlap and "wire-up" the new overlap args and announce the behavior together.

I'm happy to "pre-announce" it in CHANGELOG if that's how we do it, just let me know. I just figured we were focusing CHANGELOG on changes visible to the end-user rather than when we "laid the foundation" for those features, so to speak.

There are two distinct overlap operations with completely different
implementations. This is "intra-chunk" overlap, applying overlap to
chunks resulting from text-splitting an oversized element.
@scanny
Copy link
Collaborator Author

scanny commented Dec 22, 2023

Ok, in the interest of time I've added a CHANGELOG entry that describes the new behavior but notes that it is not yet available from the interface.

@scanny scanny added this pull request to the merge queue Dec 22, 2023
Merged via the queue into main with commit eb1b022 Dec 22, 2023
@scanny scanny deleted the scanny/intra-chunk-overlap branch December 22, 2023 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants