-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(chunking): add overlap on chunk-splits #2305
Conversation
28cbf5c
to
f221528
Compare
Is the final goal of a series of refactoring PRs to implement the "intra-chunking" overlap? This PR seems to introduce a new concept of "intra-chunking" overlap so, it would probably be good to describe this "intra-chunking" overlap in the CHANGELOG file at this stage. What do you think? |
Yeah, I thought about that, but my thinking is that none of this is exposed yet (no way to engage these features from the unstructured API) so I thought it would be best to put the CHANGELOG updates in the later PR where these features are added to the API, "wired up" so to speak. Otherwise someone reading the CHANGELOG might go looking for something they're not going to be able to find. There's one more PR after this one to add the inter-chunk overlap capability, and then the next one (the last one) will actually add the |
Thanks @christinestraub! :) |
I think the CHANGELOG should reflect the state of the actual library. Historically, new functionality appears in the CHANGELOG necessarily shows up first as plain-old-partition parameters. I do think it makes sense to have an additional CHANGELOG item later when unstructured-ingest supports, and at that point the API should too. |
@cragwolfe This intra-chunk overlap behavior does not appear as partition parameters yet. The only way to trigger the behavior is by constructing a (lib internal-use only) Before the user can specify overlap, we have to:
My plan was to postpone those bits until after the next PR which adds the second half of overlap and "wire-up" the new overlap args and announce the behavior together. I'm happy to "pre-announce" it in CHANGELOG if that's how we do it, just let me know. I just figured we were focusing CHANGELOG on changes visible to the end-user rather than when we "laid the foundation" for those features, so to speak. |
There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element.
f221528
to
ba1e8cd
Compare
Ok, in the interest of time I've added a CHANGELOG entry that describes the new behavior but notes that it is not yet available from the interface. |
There are two distinct overlap operations with completely different implementations. This is "intra-chunk" overlap, applying overlap to chunks resulting from text-splitting an oversized element.
So if an oversized element had text "abcd efgh ijkl mnop qrst" and was split at 15 chars with overlap of 5, it would produce "abcd efgh ijkl" and "ijkl mnop qrst". Any inter-chunk overlap from the prior chunk and applied at the beginning of the string (before "abcd") is handled in a separate operation in the next PR.