php_nub_qq php_nub_qq - 1 month ago 9
PHP Question

Micro optimization on array keys

I have an array of which I am using some items to construct more arrays, a rough example follows.

$rows = [
[1, 2, 3, 'a', 'b', 'c'],
[4, 5, 6, 'd', 'e', 'f'],
[4, 5, 6, 'g', 'h', 'i'],
];

$derivedData = [];

foreach ($rows as $data) {

$key = $data[0] . '-' . $data[1] . '-' . $data[2];

$derivedData['itemName']['count'] ++;
$derivedData['itemName']['items'][$key]['a'] = $data[3];
$derivedData['itemName']['items'][$key]['count'] ++;
}


Now if I dump the array it's going to look something like

derivedData: [
itemName: [
count: 3
items: [
1-2-3: [
a: a,
count: 1
],
4-5-6: [
a: g,
count: 2
],
]
]
]


As you can see the keys in
derivedData.itemName.count.items
are strings. If I were to do something like this instead, would I gain any benefit?

$uniqueId = 0;
$uniqueArray = [];

$rows = [
[1, 2, 3, 'a', 'b', 'c'],
[4, 5, 6, 'd', 'e', 'f'],
[4, 5, 6, 'g', 'h', 'i'],
];

$derivedData = [];

foreach ($rows as $data) {

$uniqueArrayKey = $data[0] . '-' . $data[1] . '-' . $data[2];

if (!isset($uniqueArray[$uniqueArrayKey])) {
$uniqueArray[$uniqueArrayKey] = $uniqueId++;
}

$uniqueKey = $uniqueArray[$uniqueArrayKey];

$derivedData['itemName']['count'] ++;
$derivedData['itemName']['items'][$uniqueKey ]['a'] = $data[3];
$derivedData['itemName']['items'][$uniqueKey ]['count'] ++;
}


Now I will have an array of indexes and the actual data array.

uniqueArray: [
1-2-3: 0,
4-5-6: 1
]

derivedData: [
itemName: [
count: 3
items: [
0: [
a: a,
count: 1
],
1: [
a: g,
count: 2
],
]
]
]


The question I am asking myself is does PHP do this internally for me when using string keys, i.e. save them somewhere and reference them as pointers for the keys instead of copying them every time?

In other words - lets say I have variable
$a
, if I use that as a key in different arrays would the value of
$a
be used (and copied) for each array as key or the pointer in memory will be used, that is basically my question?

Answer

In other words - lets say I have variable $a, if I use that as a key in different arrays would the value of $a be used (and copied) for each array as key or the pointer in memory will be used, that is basically my question?

Here comes the differences between PHP >=5.4 & PHP 7 and it depends on your environment. I'm not a PHP expert and my answer might be wrong but I have been programming extensions for PHP for quite a while and I am trying to answer your question based on my observation.

In zend_hash.c, the source of PHP 5.6.26, we could find this function:

ZEND_API int _zend_hash_add_or_update(HashTable *ht, const char *arKey, uint nKeyLength, void *pData, uint nDataSize, void **pDest, int flag ZEND_FILE_LINE_DC)
{
// omitted
        if (IS_INTERNED(arKey)) {
                p = (Bucket *) pemalloc(sizeof(Bucket), ht->persistent);
                p->arKey = arKey;
        } else {
                p = (Bucket *) pemalloc(sizeof(Bucket) + nKeyLength, ht->persistent);
                p->arKey = (const char*)(p + 1);
                memcpy((char*)p->arKey, arKey, nKeyLength);
        }
// omitted
}

It seems that whether to copy the string is determined on the value of IS_INTERNED(), so where is it? First of all, in ZendAccelerator.h, we can find:

#if ZEND_EXTENSION_API_NO > PHP_5_3_X_API_NO
// omitted
#else
# define IS_INTERNED(s)             0
// omitted
#endif

So the concept of "interned string" came into existence from PHP 5.4. The string will always be copied before and in PHP 5.3. But since PHP <=5.3 is really outdated, I'd like to leave it out from this answer. And what about PHP 5.4-5.6? In zend_string.h:

#ifndef ZTS

#define IS_INTERNED(s) \
        (((s) >= CG(interned_strings_start)) && ((s) < CG(interned_strings_end)))

#else

#define IS_INTERNED(s) \
        (0)

#endif

Oh, oh, hold on, another macro, where is it again? In zend_globals_macros.h:

#ifdef ZTS
# define CG(v) TSRMG(compiler_globals_id, zend_compiler_globals *, v)
int zendparse(void *compiler_globals);
#else
# define CG(v) (compiler_globals.v)
extern ZEND_API struct _zend_compiler_globals compiler_globals;
int zendparse(void);
#endif

So in PHP 5.4-5.6 without Zend Thread Safety, if the string has already been in the memory of this specific process, a reference would be used; however with ZTS, it will always be copied. (FYI, we seldom need ZTS in Linux).

To clarify, the $uniqueKey string in this case will not be interned, because it is created at runtime. Interning only applies to compile-time known (literal) strings. @NikiC thanks for clarification

What about PHP 7? In zend_hash.c, the source of PHP 7.0.11,

static zend_always_inline zval *_zend_hash_add_or_update_i(HashTable *ht, zend_string *key, zval *pData, uint32_t flag ZEND_FILE_LINE_DC)
{
        zend_ulong h;
        uint32_t nIndex;
        uint32_t idx;
        Bucket *p;

        IS_CONSISTENT(ht);
        HT_ASSERT(GC_REFCOUNT(ht) == 1);

        if (UNEXPECTED(!(ht->u.flags & HASH_FLAG_INITIALIZED))) {
                CHECK_INIT(ht, 0);
                goto add_to_hash;
        } else if (ht->u.flags & HASH_FLAG_PACKED) {
                zend_hash_packed_to_hash(ht);
        } else if ((flag & HASH_ADD_NEW) == 0) {
                p = zend_hash_find_bucket(ht, key);

                if (p) {
// omitted
                }
        }

        ZEND_HASH_IF_FULL_DO_RESIZE(ht);        /* If the Hash table is full, resize it */

add_to_hash:
        HANDLE_BLOCK_INTERRUPTIONS();
        idx = ht->nNumUsed++;
        ht->nNumOfElements++;
        if (ht->nInternalPointer == HT_INVALID_IDX) {
                ht->nInternalPointer = idx;
        }
        zend_hash_iterators_update(ht, HT_INVALID_IDX, idx);
        p = ht->arData + idx;
        p->key = key;
        if (!ZSTR_IS_INTERNED(key)) {
                zend_string_addref(key);
                ht->u.flags &= ~HASH_FLAG_STATIC_KEYS;
                zend_string_hash_val(key);
        }
// omitted
}

ZEND_API zval* ZEND_FASTCALL _zend_hash_str_add(HashTable *ht, const char *str, size_t len, zval *pData ZEND_FILE_LINE_DC)
{
        zend_string *key = zend_string_init(str, len, ht->u.flags & HASH_FLAG_PERSISTENT);
        zval *ret = _zend_hash_add_or_update_i(ht, key, pData, HASH_ADD ZEND_FILE_LINE_RELAY_CC);
        zend_string_release(key);
        return ret;
}

FYI,

#define ZSTR_IS_INTERNED(s)                 (GC_FLAGS(s) & IS_STR_INTERNED)

Wow, so PHP 7 actually introduces a new, amazing zend_string structure and it works around with RC and garbage collection! This is far more effective than that in PHP 5.6!

In a nutshell, if you use an existed string as the key in a hash table, and of course you keep it unchanged, in PHP <=5.3, very likely to be copied; in PHP 5.4 without ZTS, referenced; in PHP 5.4 with ZTS, copied; in PHP 7, referenced.

Additionally, I've found a great article for you to read (I'll read it later as well lol): http://jpauli.github.io/2015/09/18/php-string-management.html