Skip to content

Commit bc791d5

Browse files
potter-potterpotter-potter
and
potter-potter
authored
feat: add opensearch source and destination connector (Unstructured-IO#2349)
Adds OpenSearch as a source and destination. Since OpenSearch is a fork of Elasticsearch, these connectors rely heavily on inheriting the Elasticsearch connectors whenever possible. - Adds OpenSearch source connector to be able to ingest documents from OpenSearch. - Adds OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch. - Defines an example unstructured elements schema for users to be able to setup their unstructured OpenSearch indexes easily. --------- Co-authored-by: potter-potter <[email protected]>
1 parent d7f4c24 commit bc791d5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1965
-4
lines changed

CHANGELOG.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.12.1-dev9
1+
## 0.12.1-dev10
22

33
### Enhancements
44

@@ -13,6 +13,7 @@
1313

1414
### Features
1515
* **MongoDB Source Connector.** New source connector added to all CLI ingest commands to support downloading/partitioning files from MongoDB.
16+
* **Add OpenSearch source and destination connectors.** OpenSearch, a fork of Elasticsearch, is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added OpenSearch source connector to support downloading/partitioning files. Added OpenSearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into OpenSearch.
1617

1718
### Fixes
1819

MANIFEST.in

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ include requirements/ingest-wikipedia.in
1212
include requirements/ingest-google-drive.in
1313
include requirements/ingest-gcs.in
1414
include requirements/ingest-elasticsearch.in
15+
include requirements/ingest-opensearch.in
1516
include requirements/ingest-dropbox.in
1617
include requirements/ingest-box.in
1718
include requirements/ingest-onedrive.in

Makefile

+4
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,10 @@ install-ingest-wikipedia:
179179
install-ingest-elasticsearch:
180180
python3 -m pip install -r requirements/ingest/elasticsearch.txt
181181

182+
.PHONY: install-ingest-opensearch
183+
install-ingest-opensearch:
184+
python3 -m pip install -r requirements/ingest/opensearch.txt
185+
182186
.PHONY: install-ingest-confluence
183187
install-ingest-confluence:
184188
python3 -m pip install -r requirements/ingest/confluence.txt
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
#!/usr/bin/env bash
2+
3+
EMBEDDING_PROVIDER=${EMBEDDING_PROVIDER:-"langchain-huggingface"}
4+
5+
unstructured-ingest \
6+
local \
7+
--input-path example-docs/book-war-and-peace-1225p.txt \
8+
--output-dir local-output-to-opensearch \
9+
--strategy fast \
10+
--chunk-elements \
11+
--embedding-provider "$EMBEDDING_PROVIDER" \
12+
--num-processes 4 \
13+
--verbose \
14+
opensearch \
15+
--hosts "$OPENSEARCH_HOSTS" \
16+
--username "$OPENSEARCH_USERNAME" \
17+
--password "$OPENSEARCH_PASSWORD" \
18+
--index-name "$OPENSEARCH_INDEX_NAME" \
19+
--num-processes 2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
import os
2+
3+
from unstructured.ingest.connector.elasticsearch import (
4+
ElasticsearchWriteConfig,
5+
)
6+
from unstructured.ingest.connector.local import SimpleLocalConfig
7+
from unstructured.ingest.connector.opensearch import (
8+
OpenSearchAccessConfig,
9+
SimpleOpenSearchConfig,
10+
)
11+
from unstructured.ingest.interfaces import (
12+
ChunkingConfig,
13+
EmbeddingConfig,
14+
PartitionConfig,
15+
ProcessorConfig,
16+
ReadConfig,
17+
)
18+
from unstructured.ingest.runner import LocalRunner
19+
from unstructured.ingest.runner.writers.base_writer import Writer
20+
from unstructured.ingest.runner.writers.opensearch import (
21+
OpenSearchWriter,
22+
)
23+
24+
25+
def get_writer() -> Writer:
26+
return OpenSearchWriter(
27+
connector_config=SimpleOpenSearchConfig(
28+
access_config=OpenSearchAccessConfig(
29+
hosts=os.getenv("OPENSEARCH_HOSTS"),
30+
username=os.getenv("OPENSEARCH_USERNAME"),
31+
password=os.getenv("OPENSEARCH_PASSWORD"),
32+
),
33+
index_name=os.getenv("OPENSEARCH_INDEX_NAME"),
34+
),
35+
write_config=ElasticsearchWriteConfig(
36+
batch_size_bytes=15_000_000,
37+
num_processes=2,
38+
),
39+
)
40+
41+
42+
if __name__ == "__main__":
43+
writer = get_writer()
44+
runner = LocalRunner(
45+
processor_config=ProcessorConfig(
46+
verbose=True,
47+
output_dir="local-output-to-opensearch",
48+
num_processes=2,
49+
),
50+
connector_config=SimpleLocalConfig(
51+
input_path="example-docs/book-war-and-peace-1225p.txt",
52+
),
53+
read_config=ReadConfig(),
54+
partition_config=PartitionConfig(),
55+
chunking_config=ChunkingConfig(chunk_elements=True),
56+
embedding_config=EmbeddingConfig(
57+
provider="langchain-huggingface",
58+
),
59+
writer=writer,
60+
writer_kwargs={},
61+
)
62+
runner.run()

docs/source/ingest/destination_connectors/data/elasticsearch_elements_mappings.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"analyzer": "english"
99
},
1010
"type": {
11-
"type": "keyword"
11+
"type": "text"
1212
},
1313
"embeddings": {
1414
"type": "dense_vector",
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
{"settings": {
2+
"index": {
3+
"knn": true,
4+
"knn.algo_param.ef_search": 100
5+
}
6+
},
7+
"mappings": {
8+
"properties": {
9+
"element_id": {
10+
"type": "keyword"
11+
},
12+
"text": {
13+
"type": "text",
14+
"analyzer": "english"
15+
},
16+
"type": {
17+
"type": "text"
18+
},
19+
"embeddings": {
20+
"type": "knn_vector",
21+
"dimension": 384
22+
},
23+
"metadata": {
24+
"type": "object",
25+
"properties": {
26+
"category_depth": {
27+
"type": "integer"
28+
},
29+
"parent_id": {
30+
"type": "keyword"
31+
},
32+
"attached_to_filename": {
33+
"type": "keyword"
34+
},
35+
"filetype": {
36+
"type": "keyword"
37+
},
38+
"last_modified": {
39+
"type": "date"
40+
},
41+
"file_directory": {
42+
"type": "keyword"
43+
},
44+
"filename": {
45+
"type": "keyword"
46+
},
47+
"data_source": {
48+
"type": "object",
49+
"properties": {
50+
"url": {
51+
"type": "text",
52+
"analyzer": "standard"
53+
},
54+
"version": {
55+
"type": "keyword"
56+
},
57+
"date_created": {
58+
"type": "date"
59+
},
60+
"date_modified": {
61+
"type": "date"
62+
},
63+
"date_processed": {
64+
"type": "date"
65+
},
66+
"record_locator": {
67+
"type": "keyword"
68+
},
69+
"permissions_data": {
70+
"type": "object"
71+
}
72+
}
73+
},
74+
"coordinates": {
75+
"type": "object",
76+
"properties": {
77+
"system": {
78+
"type": "keyword"
79+
},
80+
"layout_width": {
81+
"type": "float"
82+
},
83+
"layout_height": {
84+
"type": "float"
85+
},
86+
"points": {
87+
"type": "float"
88+
}
89+
}
90+
},
91+
"languages": {
92+
"type": "keyword"
93+
},
94+
"page_number": {
95+
"type": "integer"
96+
},
97+
"page_name": {
98+
"type": "keyword"
99+
},
100+
"url": {
101+
"type": "text",
102+
"analyzer": "standard"
103+
},
104+
"links": {
105+
"type": "object"
106+
},
107+
"link_urls": {
108+
"type": "text"
109+
},
110+
"link_texts": {
111+
"type": "text"
112+
},
113+
"sent_from": {
114+
"type": "text",
115+
"analyzer": "standard"
116+
},
117+
"sent_to": {
118+
"type": "text",
119+
"analyzer": "standard"
120+
},
121+
"subject": {
122+
"type": "text",
123+
"analyzer": "standard"
124+
},
125+
"section": {
126+
"type": "text",
127+
"analyzer": "standard"
128+
},
129+
"header_footer_type": {
130+
"type": "keyword"
131+
},
132+
"emphasized_text_contents": {
133+
"type": "text"
134+
},
135+
"emphasized_text_tags": {
136+
"type": "keyword"
137+
},
138+
"text_as_html": {
139+
"type": "text",
140+
"analyzer": "standard"
141+
},
142+
"regex_metadata": {
143+
"type": "object"
144+
},
145+
"detection_class_prob": {
146+
"type": "float"
147+
}
148+
}
149+
}
150+
}
151+
}
152+
}

docs/source/ingest/destination_connectors/elasticsearch.rst

+10
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,13 @@ upstream local connector.
3030
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> elasticsearch --help``.
3131

3232
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.
33+
34+
Vector Search Sample Mapping
35+
----------------------------
36+
37+
To make sure the schema of the index matches the data being written to it, a sample mapping json can be used.
38+
39+
.. literalinclude:: ./data/elasticsearch_elements_mapping.json
40+
:language: json
41+
:linenos:
42+
:caption: Object description
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
OpenSearch
2+
======================
3+
4+
Batch process all your records using ``unstructured-ingest`` to store structured outputs locally on your filesystem and upload those local files to an OpenSearch index.
5+
6+
First you'll need to install OpenSearch dependencies as shown here.
7+
8+
.. code:: shell
9+
10+
pip install "unstructured[opensearch]"
11+
12+
Run Locally
13+
-----------
14+
The upstream connector can be any of the ones supported, but for convenience here, showing a sample command using the
15+
upstream local connector.
16+
17+
.. tabs::
18+
19+
.. tab:: Shell
20+
21+
.. literalinclude:: ./code/bash/opensearch.sh
22+
:language: bash
23+
24+
.. tab:: Python
25+
26+
.. literalinclude:: ./code/python/opensearch.py
27+
:language: python
28+
29+
30+
For a full list of the options the CLI accepts check ``unstructured-ingest <upstream connector> opensearch --help``.
31+
32+
NOTE: Keep in mind that you will need to have all the appropriate extras and dependencies for the file types of the documents contained in your data storage platform if you're running this locally. You can find more information about this in the `installation guide <https://unstructured-io.github.io/unstructured/installing.html>`_.
33+
34+
Vector Search Sample Mapping
35+
----------------------------
36+
37+
To make sure the schema of the index matches the data being written to it, a sample mapping json can be used.
38+
39+
.. literalinclude:: ./data/opensearch_elements_mapping.json
40+
:language: json
41+
:linenos:
42+
:caption: Object description
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/usr/bin/env bash
2+
3+
unstructured-ingest \
4+
opensearch \
5+
--metadata-exclude filename,file_directory,metadata.data_source.date_processed \
6+
--url http://localhost:9200 \
7+
--index-name movies \
8+
--fields 'ethnicity, director, plot' \
9+
--output-dir opensearch-ingest-output \
10+
--num-processes 2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/usr/bin/env bash
2+
3+
unstructured-ingest \
4+
opensearch \
5+
--metadata-exclude filename,file_directory,metadata.data_source.date_processed \
6+
--url http://localhost:9200 \
7+
--index-name movies \
8+
--fields 'ethnicity, director, plot' \
9+
--output-dir opensearch-ingest-output \
10+
--num-processes 2 \
11+
--partition-by-api \
12+
--api-key "<UNSTRUCTURED-API-KEY>"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
from unstructured.ingest.connector.opensearch import (
2+
OpenSearchAccessConfig,
3+
SimpleOpenSearchConfig,
4+
)
5+
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
6+
from unstructured.ingest.runner import OpenSearchRunner
7+
8+
if __name__ == "__main__":
9+
runner = OpenSearchRunner(
10+
processor_config=ProcessorConfig(
11+
verbose=True,
12+
output_dir="opensearch-ingest-output",
13+
num_processes=2,
14+
),
15+
read_config=ReadConfig(),
16+
partition_config=PartitionConfig(
17+
metadata_exclude=["filename", "file_directory", "metadata.data_source.date_processed"],
18+
),
19+
connector_config=SimpleOpenSearchConfig(
20+
access_config=OpenSearchAccessConfig(hosts=["http://localhost:9200"]),
21+
index_name="movies",
22+
fields=["ethnicity", "director", "plot"],
23+
),
24+
)
25+
runner.run()

0 commit comments

Comments
 (0)