regex pattern matching unicode characters for file renaming

edited 11 days ago
I'm trying to understand what kind of regex expressions are supported in the file renaming syntax, specifically how to be able to match the full range of unicode letters \p{L}.

The only way to match letters seems to be like this [a-zA-Z0-9_] -- neither standard regex expressions like \w or \S seem to work, and thus the one I need, \p{Letter} or \p{L}, doesn't either. Any help how I could make this work? I did include the u flag for unicode support in regexOpts but to no avail.

The context is this problem, trying to construct a regex pattern that extracts all initials of given names. The following pattern works for names in Latin script that don't have any diacritics:
{{ creators max="1" name="given" replaceFrom="(^| )([a-zA-Z0-9_)([a-zA-Z0-9_.])*" replaceTo="$2" regexOpts="gu" }}

But it fails in all other cases.

I thought I could use \p{L} to include all unicode characters defined as letters, but it doesn't work.
{{ creators max="1" name="given" replaceFrom="(^| )(\p{L})(\p{L}|.)*" replaceTo="$2" regexOpts="gu" }}

Any help understanding the regex options (or suggestions for alternative ways to accomplish what I need) much appreciated.

EDIT:
FWIW, I've also tried with unicode character class ranges, but to no avail:
[\u0041-\u007A\u00C0-\u00FF\u0100-\u017F\u0400-\u04FF\u0370-\u03FF]



  • edited 6 days ago
    I thought I could use \p{L} to include all unicode characters defined as letters, but it doesn't work.
    With the u included in regexOpts as you did (we will consider making this included by default), this template should work. In fact, I copy-pasted the template you posted and got the following result:

    https://s3.amazonaws.com/zotero.org/images/forums/u3978561/9bk6d0l4efdmy864k6uz.png
    If that's not what you're seeing, please explain what result you were expecting and what you are actually seeing. Also, please make sure you're running the latest stable (or beta) version of Zotero.

    As a side note, if you're looking for author initials, you can use template parameters dedicated for this purpose, e.g.

    {{ creators max="1" name="given" initialize="given" initialize-with="" }}
  • Thank you, tnajdek, for testing this and my apologies - I should have posted back here as well: The problem was related to the Attanger plugin for file organization, which uses the standard zotero renaming syntax, but which had a bug that broke any regex that included a backslash.

    The bug is fixed now, so it works (for one author).

    For posteriority, the full code I ended up using is:
    {{ creators max="1" name="family" }}, {{ creators max="1" name="given" replaceFrom="(?:^|\s|-)(\p{L})(?:(\p{L}*\s(da|de|di|(van|von)( der| de)?)\b)|([\p{L}'.]*))" replaceTo="$1" regexOpts="gui" }}
    A simpler version, not filtering for lowercase particles that may have been stored as or be part of first names, is:
    {{ creators max="1" name="family" }}, {{ creators max="1" name="given" replaceFrom="(?:^|\s|-)(\p{L})(\p{L}|[.'])*" replaceTo="$1" regexOpts="gu" }}

    Note, this only works for the first author, which in my use case is sufficient.

    In the long run, it would still be great to have template parameters that would allow to have more than just the first initial of the author's given name so that it would act on any authors included in the pattern if someone so wishes.




Sign In or Register to comment.