5 min read

Postgres Arrays: When They Help and When They Hide a Modeling Problem

Arrays are a Postgres feature that can simplify code and a feature that can lock you into bad design. Here is the rule I use to decide when to reach for them.

Postgres arrays are nicer than they have any right to be. You can store a list of values in a single column, query containment, index it, even do array operations in WHERE clauses. The temptation to use them is constant.

The rule I have settled on: arrays are great for small bounded values. They are awkward for entities that need their own life. Knowing the difference saves you from migrations later.

When arrays are the right call

1. The values do not have their own identity. Tags on a post. Roles a user has. Categories a product belongs to. The values are labels, not records. They do not have created_at, updated_at, or audit history of their own.

CREATE TABLE posts (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  title TEXT NOT NULL,
  tags TEXT[] NOT NULL DEFAULT '{}'
);

INSERT INTO posts (title, tags) VALUES
  ('Postgres performance', ARRAY['postgres', 'performance', 'database']);

2. The set is small. Under a few dozen elements per row. Arrays with hundreds of elements are usable but indexing gets expensive.

3. The set is queried as a unit. "All posts with tag X" or "posts with any of these tags." If queries always operate on the array as a whole, arrays are convenient. If queries need to do per-element joins or aggregations, a separate table is better.

4. Updates are infrequent. Arrays in Postgres are immutable on disk — updating one element rewrites the whole array. For frequently-modified collections, this is expensive.

When arrays are wrong

The symptom that I look for: the array elements need their own metadata.

-- Wrong: array elements need created_at, position, is_primary
CREATE TABLE users (
  emails TEXT[]
);

-- Right: separate table
CREATE TABLE user_emails (
  user_id BIGINT REFERENCES users(id),
  email TEXT NOT NULL,
  is_primary BOOLEAN NOT NULL DEFAULT false,
  verified_at TIMESTAMPTZ,
  PRIMARY KEY (user_id, email)
);

The moment you want to track "when was this added," "is this the primary one," "who verified it" — you need a real table. Arrays do not give you those columns.

The migration from "array column" to "separate table" is unpleasant. Save yourself by recognizing the shift early.

Indexing arrays

For containment queries (WHERE tags @> ARRAY['postgres']), a GIN index:

CREATE INDEX posts_tags_gin
  ON posts USING GIN (tags);

Now WHERE tags @> ARRAY['postgres'] ("posts containing 'postgres'") uses the index. Same for WHERE tags && ARRAY['postgres', 'performance'] ("posts with any of these tags").

The GIN index is comparable in size to the array data itself. For small arrays (<10 elements), it is cheap. For arrays with hundreds of elements, the index gets large.

Common operations

Append:

UPDATE posts SET tags = tags || 'new-tag' WHERE id = 1;

Remove:

UPDATE posts SET tags = array_remove(tags, 'old-tag') WHERE id = 1;

Check membership:

SELECT * FROM posts WHERE 'postgres' = ANY(tags);
-- Or, equivalently:
SELECT * FROM posts WHERE tags @> ARRAY['postgres'];

The @> form uses GIN indexes; the = ANY form does not. For indexed lookups, prefer @>.

Count elements:

SELECT array_length(tags, 1) FROM posts WHERE id = 1;

Unnest for joins:

SELECT p.id, t.tag
FROM posts p, unnest(p.tags) AS t(tag)
WHERE t.tag IN (SELECT name FROM popular_tags);

Aggregating into arrays

The reverse — collecting rows into an array:

SELECT user_id, array_agg(role) AS roles
FROM user_roles
GROUP BY user_id;

array_agg is the canonical aggregator. For deduplication:

SELECT user_id, array_agg(DISTINCT role) AS roles
FROM user_roles
GROUP BY user_id;

This is useful for materialized denormalization — flattening a child table into an array column on the parent.

A pragmatic example

A tagging system at scale. Two designs:

Design A: arrays

CREATE TABLE posts (
  id BIGINT,
  title TEXT,
  tags TEXT[]
);
CREATE INDEX posts_tags_gin ON posts USING GIN (tags);

Queries are simple: WHERE tags @> ARRAY['postgres']. No joins.

Design B: junction table

CREATE TABLE posts (
  id BIGINT, title TEXT
);
CREATE TABLE post_tags (
  post_id BIGINT REFERENCES posts(id),
  tag TEXT NOT NULL,
  PRIMARY KEY (post_id, tag)
);
CREATE INDEX post_tags_tag_post ON post_tags (tag, post_id);

Queries are joins: SELECT * FROM posts p JOIN post_tags pt ON pt.post_id = p.id WHERE pt.tag = 'postgres'.

For read-heavy systems with simple queries, Design A is shorter and similarly performant. For systems where tags need their own metadata (categories, descriptions, parent-child relationships), Design B is essential.

The team I worked with that started with Design A migrated to Design B when they wanted to add tag display order, deprecation, and translations. The migration took two weeks. Picking right the first time would have saved that.

Things to avoid

  • Arrays of UUIDs as foreign keys to another table. The database does not enforce referential integrity for array elements. Use a junction table.
  • Multidimensional arrays. They work but are awkward to query. Usually a sign that the data wants a different shape.
  • Very long arrays. Arrays with thousands of elements indicate the data wants to be in its own table.

What I do today

For any new column that wants to hold a list of values, the question is: do these values have their own life?

  • Tags, roles, categories, simple labels → array.
  • Anything that might gain metadata, history, or relationships → separate table.

This decision is rarely revisited cleanly. Pick deliberately, with the future in mind.