Stack overflow sourced, adapted and personally tested code
Extract email address from string using tsql
A continuation of working with strings in TSQL specifically linked to emails.
Firstly create the following Function
CREATE FUNCTION [dbo].[fnFindPatternLocation] ( @string NVARCHAR(MAX), @term NVARCHAR(MAX) ) RETURNS TABLE AS RETURN ( SELECT pos = Number - LEN(@term) FROM (SELECT Number, Item = LTRIM(RTRIM(SUBSTRING(@string, Number, CHARINDEX(@term, @string + @term, Number) - Number))) FROM (SELECT ROW_NUMBER() OVER (ORDER BY [object_id]) FROM sys.all_objects) AS n(Number) WHERE Number > 1 AND Number <= CONVERT(INT, LEN(@string)) AND SUBSTRING(@term + @string, Number, LEN(@term)) = @term ) AS y);
Then create a View of what you are interested in as follows.. Note here I am taking out the carriage return as my subsequent query doesn’t like them and in emails they frequently exist.
CREATE VIEW [dbo].[v001] as SELECT pkid, REPLACE(body, CHAR(13) + CHAR(10),' ') as body1 from t001email
Then run the newly created View through a query.
SELECT pkid, body1, pos, SUBSTRING(body,beginningOfEmail,endOfEmail-beginningOfEmail) AS email FROM v001 CROSS APPLY (SELECT pos FROM dbo.fnFindPatternLocation(body1, '@')) AS A(pos) CROSS APPLY (SELECT CHARINDEX(' ',body1 + ' ', pos)) AS B(endOfEmail) CROSS APPLY (SELECT pos - CHARINDEX(' ', REVERSE(SUBSTRING(body, 1, pos))) + 2) AS C(beginningOfEmail)
Couple of things here
Multiple emails will be picked out and placed as separate records so if there a string that reads
This is a sentence with two emials first@gmail.com and a second second@gmail.com
it will return
first@gmail.com
second@gmail.com
If an email starts the field then this will NOT work after finding the @ symbol it will count forward and fail to find a space and so set space before to Null it will then return just the domain of the email. I will be looking to fix this at some point.
Secondly if the emails within the field contain contiguous special html characters such as < or > these will be picked up and inculded as if they are part of the email addresses.
We can fix this by scanning through the varchar(max) field and stripping out special characters.
NOTE : If you are working with email bodies carriage returns will also screw up the above query in which case consider running the field through some kind of replace view with similar syntax as
CREATE VIEW v002 as SELECT pkid, REPLACE(body, CHAR(13) + CHAR(10),' ') as txtBodyWithoutReturns from t001email