regex pattern matching unicode characters for file renaming
I'm trying to understand what kind of regex expressions are supported in the file renaming syntax, specifically how to be able to match the full range of unicode letters
The only way to match letters seems to be like this
The context is this problem, trying to construct a regex pattern that extracts all initials of given names. The following pattern works for names in Latin script that don't have any diacritics:
But it fails in all other cases.
I thought I could use
Any help understanding the regex options (or suggestions for alternative ways to accomplish what I need) much appreciated.
EDIT:
FWIW, I've also tried with unicode character class ranges, but to no avail:
\p{L}
. The only way to match letters seems to be like this
[a-zA-Z0-9_]
-- neither standard regex expressions like \w
or \S
seem to work, and thus the one I need, \p{Letter}
or \p{L}
, doesn't either. Any help how I could make this work? I did include the u
flag for unicode support in regexOpts
but to no avail. The context is this problem, trying to construct a regex pattern that extracts all initials of given names. The following pattern works for names in Latin script that don't have any diacritics:
{{ creators max="1" name="given" replaceFrom="(^| )([a-zA-Z0-9_)([a-zA-Z0-9_.])*" replaceTo="$2" regexOpts="gu" }}
But it fails in all other cases.
I thought I could use
\p{L}
to include all unicode characters defined as letters, but it doesn't work. {{ creators max="1" name="given" replaceFrom="(^| )(\p{L})(\p{L}|.)*" replaceTo="$2" regexOpts="gu" }}
Any help understanding the regex options (or suggestions for alternative ways to accomplish what I need) much appreciated.
EDIT:
FWIW, I've also tried with unicode character class ranges, but to no avail:
[\u0041-\u007A\u00C0-\u00FF\u0100-\u017F\u0400-\u04FF\u0370-\u03FF]
u
included inregexOpts
as you did (we will consider making this included by default), this template should work. In fact, I copy-pasted the template you posted and got the following result:https://s3.amazonaws.com/zotero.org/images/forums/u3978561/9bk6d0l4efdmy864k6uz.png
If that's not what you're seeing, please explain what result you were expecting and what you are actually seeing. Also, please make sure you're running the latest stable (or beta) version of Zotero.
As a side note, if you're looking for author initials, you can use template parameters dedicated for this purpose, e.g.
{{ creators max="1" name="given" initialize="given" initialize-with="" }}
The bug is fixed now, so it works (for one author).
For posteriority, the full code I ended up using is:
{{ creators max="1" name="family" }}, {{ creators max="1" name="given" replaceFrom="(?:^|\s|-)(\p{L})(?:(\p{L}*\s(da|de|di|(van|von)( der| de)?)\b)|([\p{L}'.]*))" replaceTo="$1" regexOpts="gui" }}
A simpler version, not filtering for lowercase particles that may have been stored as or be part of first names, is:
{{ creators max="1" name="family" }}, {{ creators max="1" name="given" replaceFrom="(?:^|\s|-)(\p{L})(\p{L}|[.'])*" replaceTo="$1" regexOpts="gu" }}
Note, this only works for the first author, which in my use case is sufficient.
In the long run, it would still be great to have template parameters that would allow to have more than just the first initial of the author's given name so that it would act on any authors included in the pattern if someone so wishes.