Functions to extract information from unstructured text.
Click here to learn how to use Extract Wrangles in Excel.
Extract geographical information from unstructured text such as streets, cities or countries.
SampleParameterswrangles:
- extract.address:
input: Location
output: Street
dataType: streets
where: SUBSTRING(Location, 1, 3) = '221'
Location |
Street |
221 B Baker St., London, England, United Kingdom |
['221 B Baker St.'] |
London SW1A 1AA, London, England, United Kingdom |
|
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str, list |
|
output |
✓ |
str, list |
|
dataType |
✓ |
streets / cities / regions / countries |
Specific part of the address to extract |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Use the power of AI (OpenAI's chatGPT in particular) to extract meaningful data.
Note: This extract outputs a column for each item in the output list.
SampleParametersOutput Parameterswrangles:
- extract.ai:
api_key: Your OpenAI api key
input: Product Specs # optional
output:
Blade Diameter:
type: number
description: The diameter of the blade used, reported in inches.
default: N/A
examples:
- 4.5"
- 8 inch
Max. RPM:
type: number
description: The maximum rotations per minute (rpm).
default: 3600
examples:
- 3600 max. rpm
| | |
Product Specs |
18V Cordless 4.5in angle grinder |
120V 12in chop saw 3600 max. rpm |
|
→
|
Blade Diameter |
Max. RPM |
4.5 inches |
|
12 inches |
3600 |
|
Parameter |
Required |
Data Type |
Notes |
input |
|
list |
Name or list of input columns, defaults to entire dataframe. |
output |
✓ |
list |
List and description of the output you want. |
api_key |
✓ |
str |
Your OpenAI api key for the model. |
model |
|
str |
The name of the model (ie "gpt-3.5-turbo" which is the default) |
threads |
|
int |
The number of requests to send in parallel. |
timeout |
|
int |
The number of seconds to wait for a response before timing out |
retries |
|
int |
The number of times to retry if the request fails. This will apply exponential backoff to help with rate limiting. |
messages |
|
str, list |
Optional. Provide additional overall instructions for the AI. |
url |
|
str |
Override the default url for the AI endpoint. Must use the OpenAI chat completions API. |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Parameter |
Required |
Data Type |
Notes |
type |
|
str |
JSON schema data type (string, number, integer, object, array, boolean or null). Note any outputs of type array without a child type defined should be strings. |
description |
|
str |
List and description of the output you want. |
default |
|
str, num, int, bool, "null", object, array |
A default value to return. |
examples |
|
array |
Provide examples of typical values to return. |
enum |
|
array |
List of possible values for the output. |
Extract numeric attributes from unstructured text such as lengths or voltages.
SampleParameterswrangles:
- extract.attributes:
input: tools
output: attributes
responseContent: span
Tools |
Attributes (Span) |
hammer 5kg, 0.5m |
{'length': ['0.5m'], 'mass': ['5kg']} |
Tools |
Attributes (Object) |
hammer 5kg, 0.5m |
{'length': [{'unit': 'metre', 'value': 0.5}], 'mass': [{'unit': 'kilogram', 'value': 5.0}]} |
wrangles:
- extract.attributes:
input: Tools
output: attributes
responseContent: span
attribute_type: mass # Specific attributes
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str, list |
|
output |
✓ |
str, list |
|
responseContent |
|
span / object |
If omited defaults to span (text found). Object returns an object with the value and unit. |
attribute_type |
|
See list of options below |
Specify which attributes to return, if omitted defaults to all |
desired_unit |
|
str |
Convert the extracted unit to the desired unit |
bound |
|
min/ mid/ max |
When returning an object, if the input is a range (e.g. 10-20mm) set the value to return. min, mid or max. Default mid. |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
List of Attributes:
- angle
- area
- capacitance
- charge
- current
- data transfer rate
- electric potential
- electrical conductance
- electrical resistance
- energy
- force
- frequency
- inductance
- instance frequency
- length
- luminous flux
- mass
- power
- pressure
- speed
- temperature
- time
- volume
- volumetric flow
Extract text properties in brackets from the input
SampleParameterswrangles:
- extract.brackets:
input: Data
output: Output
Data |
Output |
{Hello} |
Hello |
[Wrangles] |
Wrangles |
(!) |
! |
<!> |
! |
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str, list |
|
output |
✓ |
str, list |
|
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
find |
|
str, list |
The type of brackets to find (round '()', square '[]', curly '{}', angled '<>'). Default is all brackets. |
include_brackets |
|
bool |
Include the brackets in the output |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Extract alphanumeric codes from unstructured text. Can be performed on one column or multiple columns.
SampleParameters¶ Extracting Codes From Single Column
# One column input
wrangles:
- extract.codes:
input: Secret
output: Code Extract
Secret |
Code Extract |
to gain access use Z1ON0101 |
['Z1ON0101'] |
¶ Extracting Codes From Multiple Columns
# Multi column input
wrangles:
- extract.codes:
input:
- code1
- code2
output: Codes
code1 |
code2 |
Codes |
code CH465517080-1 |
code CH465517080-2 |
['CH465517080-1', 'CH465517080-2'] |
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str, list |
|
output |
✓ |
str, list |
|
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Extract data from the input using a DIY or bespoke extraction wrangle. Can be performed on one column or multiple columns.
It is important to note that (non-regex) pattern matching extracts match whole words that are separated by word boundaries. Word boundaries include anything that is not a letter, number or an underscore.
SampleParameters¶ Extracting Wood Types From Single Column
# One column input
wrangles:
- extract.custom:
input: Product
output: Wood Types
model_id: model_id_here
Product |
Wood Types |
Dining Oakwood Chair |
Oakwood |
Living Room Teakwood Frame Mirror |
Teakwood |
¶ Extracting Wood Types From Multiple Columns
# Multi column input
wrangles:
- extract.custom:
input:
- Part 1 of 2
- Part 2 of 2
output: Wood Types
model_id: model_id_here
Part 1 of 2 |
Part 2 of 2 |
Wood Types |
Dining Acacia Wood Table |
Imitation Wood Table Chairs |
['Acacia Wood', 'Imitation Wood'] |
# Multiple Models
wrangles:
- extract.custom:
input:
- Product
- Product
output:
- Wood Types
- Item Type
model_id:
- wood_Type_model_id
- item_type_model_id
Product |
Wood Types |
Item Type |
Dining Oakwood Chair |
Oakwood |
Chair |
Living Room Teakwood Frame Mirror |
Teakwood |
Mirror |
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str, list |
|
output |
✓ |
str, list |
|
model_id |
✓ |
str, list |
Model id of the custom Extract Wrangle being used. |
use_labels |
|
bool |
Adds labels (if given) to the extracted data, defaults to false. See here for more information. |
first_element |
|
bool |
If true, returns only the first element of matched values. |
case_sensitive |
|
bool |
Allows the wrangle to be case sensitive if set to True, default is False. |
use_spellcheck |
|
bool |
Use spellcheck to correct spelling mistakes in the input. |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Note: When using an AI based extract, the output will be returned as a dictionary.
Extract date properties from a date (day, month, year, etc...)
SampleParameterswrangles:
- extract.date_properties:
input: Date
output: Output
property: month_name
Date |
Output |
1992-08-13 00:00:00 |
August |
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str |
|
output |
|
str |
If not given, overwrites input |
property |
✓ |
day / day_of_year / month / month_name / weekday / week_day_name / week_year / quarter |
Property to extract from date |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Extract date range from two dates
SampleParameterswrangles:
- extract.date_range:
start_time: Start
end_time: End
output: Output
range: months
Start |
End |
Output |
1992-08-13 00:00:00 |
2023-08-13 00:00:00 |
371 |
Parameter |
Required |
Data Type |
Notes |
start_time |
✓ |
str |
Name of the start date column |
end_time |
✓ |
str |
Name of the end date column |
output |
✓ |
str |
Name of the output column |
range |
|
See below for options |
Type of frequency to count, defaults to day |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Range options:
- business days
- days
- weeks
- months
- semi months
- business month ends
- month starts
- semi month starts
- business month starts
- quarters
- quarter starts
- years
- business hours
- hours
- minutes
- seconds
- milliseconds
Extract Text and Links from HTML Elements
SampleParameterswrangles:
- extract.html:
input: HTML
output: Text
data_type: text
HTML |
Text |
<a href="https://www.wrangleworks.com/">Wrangle Works!</a> |
Wrangle Works! |
wrangles:
- extract.html:
input: HTML
output: Links
data_type: links
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str, list |
|
output |
|
str, list |
|
data_type |
✓ |
text / links |
The type of data to extract |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Extract categorical properties from unstructured text such as colours or materials.
SampleParameterswrangles:
- extract.properties:
input: water bottles
output: properties
Products |
Product Properties |
Stainless Steel Blue Bottle |
{'Colours': ['Blue'], 'Materials': ['Stainless Steel']} |
Plastic Yellow Bottle |
{'Colours': ['Yellow'], 'Materials': ['Plastic']} |
wrangles:
- extract.properties:
input: water bottles
output: properties
property_type: colours # Optional
Products |
Product Colours |
Stainless Steel Blue Bottle |
['Blue'] |
Plastic Yellow Bottle |
['Yellow'] |
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str, list |
|
output |
✓ |
str, list |
|
property_type |
|
str |
The specific type of properties to extract (Colours, Materials, Shapes, Standards). Defaults to None, which will return all properties. |
return_data_type |
|
str |
The format to return the data (list or string), defaults to list |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |
Extract single values using regex
SampleParameterswrangles:
- extract.regex:
input: Product
output: GPM
find: \d\.?\d? ?gpm
Product |
GPM |
3.4 gpm water pump |
3.4 gpm |
2gpm water pump |
2gpm |
Parameter |
Required |
Data Type |
Notes |
input |
✓ |
str |
|
output |
✓ |
str |
|
find |
✓ |
str |
Pattern to find using regex |
where |
|
str |
Filter the data to only apply the wrangle to certain rows using an equivalent to a SQL where criteria, such as column1 = 123 OR column2 = 'abc' |
where_params |
|
str |
Variables to use in conjunctions with where. This allows the query to be parameterized. This uses sqlite syntax (? or :name) |
if |
|
str |
A condition that will determine whether the action runs or not as a whole. |