Skip to content

Commit d7f4c24

Browse files
potter-potterpotter-potter
and
potter-potter
authored
fix documentation for chroma (Unstructured-IO#2403)
To test: cd docs && make HTML changelogs: point main readme to the correct connector html page point chroma docs to correct sample code --------- Co-authored-by: potter-potter <[email protected]>
1 parent aaf3fd9 commit d7f4c24

File tree

10 files changed

+50
-22
lines changed

10 files changed

+50
-22
lines changed

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.12.1-dev8
1+
## 0.12.1-dev9
22

33
### Enhancements
44

@@ -20,6 +20,7 @@
2020
* **Pin version of unstructured-client** Set minimum version of unstructured-client to avoid raising a TypeError when passing `api_key_auth` to `UnstructuredClient`
2121
* **Fix the serialization of the Pinecone destination connector.** Presence of the PineconeIndex object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
2222
* **Fix the serialization of the Elasticsearch destination connector.** Presence of the _client object breaks serialization due to TypeError: cannot pickle '_thread.lock' object. This removes that object before serialization.
23+
* **Fix documentation and sample code for Chroma.** Was pointing to wrong examples..
2324

2425
## 0.12.0
2526

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ In general, these functions fall into several categories:
193193
- *Embedding* encoder classes provide an interfaces for easily converting preprocessed text to
194194
vectors.
195195

196-
The **Connectors** 🔗 in `unstructured` serve as vital links between the pre-processing pipeline and various data storage platforms. They allow for the batch processing of documents across various sources, including cloud services, repositories, and local directories. Each connector is tailored to a specific platform, such as Azure, Google Drive, or Github, and comes with unique commands and dependencies. To see the list of Connectors available in `unstructured` library, please check out the [Connectors GitHub folder](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest/connector) and [documentation](https://unstructured-io.github.io/unstructured/connectors.html)
196+
The **Connectors** 🔗 in `unstructured` serve as vital links between the pre-processing pipeline and various data storage platforms. They allow for the batch processing of documents across various sources, including cloud services, repositories, and local directories. Each connector is tailored to a specific platform, such as Azure, Google Drive, or Github, and comes with unique commands and dependencies. To see the list of Connectors available in `unstructured` library, please check out the [Connectors GitHub folder](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest/connector) and [documentation](https://unstructured-io.github.io/unstructured/ingest/index.html)
197197

198198
### PDF Document Parsing Example
199199
The following examples show how to get started with the `unstructured` library. You can parse over a dozen document types with one line of code! Use this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the example below.

docs/source/ingest/destination_connectors/chroma.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@ upstream local connector.
1818

1919
.. tab:: Shell
2020

21-
.. literalinclude:: ./code/bash/pinecone.sh
21+
.. literalinclude:: ./code/bash/chroma.sh
2222
:language: bash
2323

2424
.. tab:: Python
2525

26-
.. literalinclude:: ./code/python/pinecone.py
26+
.. literalinclude:: ./code/python/chroma.py
2727
:language: python
2828

2929

docs/source/ingest/destination_connectors/code/bash/chroma.sh

+2
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,6 @@ unstructured-ingest \
1414
--host "localhost" \
1515
--port 8000 \
1616
--collection-name "collection name" \
17+
--tenant "default_tenant" \
18+
--database "default_database" \
1719
--batch-size 80
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
from unstructured.ingest.connector.chroma import (
2+
ChromaAccessConfig,
3+
ChromaWriteConfig,
4+
SimpleChromaConfig,
5+
)
6+
from unstructured.ingest.connector.local import SimpleLocalConfig
17
from unstructured.ingest.interfaces import (
28
ChunkingConfig,
39
EmbeddingConfig,
@@ -6,28 +12,44 @@
612
ReadConfig,
713
)
814
from unstructured.ingest.runner import LocalRunner
15+
from unstructured.ingest.runner.writers.base_writer import Writer
16+
from unstructured.ingest.runner.writers.chroma import (
17+
ChromaWriter,
18+
)
19+
20+
21+
def get_writer() -> Writer:
22+
return ChromaWriter(
23+
connector_config=SimpleChromaConfig(
24+
access_config=ChromaAccessConfig(),
25+
host="localhost",
26+
port=8000,
27+
collection_name="elements",
28+
tenant="default_tenant",
29+
database="default_database",
30+
),
31+
write_config=ChromaWriteConfig(),
32+
)
33+
934

1035
if __name__ == "__main__":
36+
writer = get_writer()
1137
runner = LocalRunner(
1238
processor_config=ProcessorConfig(
1339
verbose=True,
14-
output_dir="local-output-to-pinecone",
40+
output_dir="local-output-to-chroma",
1541
num_processes=2,
1642
),
43+
connector_config=SimpleLocalConfig(
44+
input_path="example-docs/book-war-and-peace-1225p.txt",
45+
),
1746
read_config=ReadConfig(),
1847
partition_config=PartitionConfig(),
1948
chunking_config=ChunkingConfig(chunk_elements=True),
2049
embedding_config=EmbeddingConfig(
2150
provider="langchain-huggingface",
2251
),
23-
writer_type="chroma",
24-
writer_kwargs={
25-
"host": "localhost",
26-
"port": 8000,
27-
"collection_name": "test-collection",
28-
"batch_size": 80,
29-
},
30-
)
31-
runner.run(
32-
input_path="example-docs/fake-memo.pdf",
52+
writer=writer,
53+
writer_kwargs={},
3354
)
55+
runner.run()

examples/ingest/chroma/ingest.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
2424
chroma \
2525
--path "<Location where Chroma is persisted, if not connecting via http>" \
2626
--settings "<Dictionary of settings to communicate with the chroma server>" \
27-
--tenant "<Tenant to use for this client>" \
28-
--database "<Database to use for this client>" \
27+
--tenant "<Tenant to use for this client. Chroma defaults to 'default_tenant'>" \
28+
--database "<Database to use for this client. Chroma defaults to 'default_database'>" \
2929
--host "<Hostname of the Chroma server>" \
3030
--port "<Port of the Chroma server>" \
3131
--ssl "<Whether to use SSL to connect to the Chroma server>" \

test_unstructured_ingest/dest/chroma.sh

+2
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ PYTHONPATH=. ./unstructured/ingest/main.py \
5353
--host "localhost" \
5454
--port 8000 \
5555
--collection-name "$COLLECTION_NAME" \
56+
--tenant "default_tenant" \
57+
--database "default_database" \
5658
--batch-size 80
5759

5860
python "$SCRIPT_DIR"/python/test-ingest-chroma-output.py --collection-name "$COLLECTION_NAME"

unstructured/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.12.1-dev8" # pragma: no cover
1+
__version__ = "0.12.1-dev9" # pragma: no cover

unstructured/ingest/cli/cmds/chroma.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,15 @@ def get_cli_options() -> t.List[click.Option]:
3030
required=False,
3131
default="default_tenant",
3232
type=str,
33-
help="The tenant to use for this client.",
33+
help="The tenant to use for this client. Chroma defaults to 'default_tenant'.",
3434
),
3535
click.Option(
3636
["--database"],
3737
required=False,
3838
default="default_database",
3939
type=str,
40-
help="The database to use for this client.",
40+
help="The database to use for this client."
41+
"Chroma defaults to 'default_database'.",
4142
),
4243
click.Option(
4344
["--host"],

unstructured/ingest/connector/chroma.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,8 @@ class SimpleChromaConfig(BaseConnectorConfig):
2929
access_config: ChromaAccessConfig
3030
collection_name: str
3131
path: t.Optional[str] = None
32-
tenant: t.Optional[str] = None
33-
database: t.Optional[str] = None
32+
tenant: t.Optional[str] = "default_tenant"
33+
database: t.Optional[str] = "default_database"
3434
host: t.Optional[str] = None
3535
port: t.Optional[int] = None
3636
ssl: bool = False

0 commit comments

Comments
 (0)