Comparing fairly different data access libraries would be a bit complicated as the performances may vary for a wide range of reasons,
mainly related to the database itself:
the size of the tables, the indexes created, etc.
However ship hold repository contains an example folder in
which you can find some real use cases for our publishing platform with implementations for various libraries. This is not a benchmark,
it is meant to give you a toy application to compare the different user experiences and test how (in)efficient queries with associations can be.
Therefore queries are monitored and execution time is measured.
The listed pretty common use cases are the following:
Create a home page where we want to display the latest posts (10) including the three last published comments with their author's data,
the tags related to each posts and the posts authors' data.
Create a user profile page where we want to display the latest comments (5) of a given user including some reference data to the post in which the comment
took place.
We also want to get the five last posts published by the user including the tags for each post.
Create a list of posts for a given tag. We want to list the five most recent posts including the data related to the author, and the three last comments.
Result
With a data set of 100 000 users, 20 000 posts, 120 000 comments and around 200 tags, I get on my machine this kind of result.
No index has been added, which is somehow unrealistic, but as the nature of queries generated by the different libraries is very different,
that would make it even more complicated to compare. Of course you are invited to run your tests with indexes.
execution time of listed use cases with different librariesCase | ship-hold | sequelize | objection |
---|
latest posts | 55ms | 149ms | 111ms |
---|
user profile | 53ms | 102ms | 81ms |
---|
posts by tag | 164ms | ERROR | 338ms |
---|
Why ?
Why does ship hold appear to perform so well compared to well established libraries ?
First of all, ship-hold remains a quite low level abstraction and does not do much other than building a SQL query, handing it to pg driver, and forward the response to
you application code. That's why if you run a query directly with the driver or through ship-hold, you will get quite similar result:
Get five last articles including their authors' data
pg.query(`SELECT "posts".*, to_json("users".*) as "author"
FROM
(SELECT * FROM "posts" ORDER BY "published_at" DESC LIMIT 5) as "posts"
JOIN "users"
ON "posts"."user_id" = "users"."user_id"
`);
sh.select('posts.*', {
value: toJson('"users".*'),
as: 'author'
})
.from({
value: sh.select()
.from('posts')
.orderBy('published_at', 'desc')
.limit(5), as: 'posts'
})
.join('users')
.on('posts.user_id', '"users"."user_id"')
.run();
Posts
.select()
.orderBy('published_at', 'desc')
.limit(5)
.include(Users)
.run();
On the other hand, ORM usually do few things such as creating model instances from the rows, etc.
That is why you would use them after all !
Second point, is that ship-hold only works with one database engine (Postgres) and tries a much as possible
to rely on the database features rather than implementing an equivalent on the Javascript side. SQL is very powerful !(and often under used).
For example, in ship-hold, queries with associations would result in a unique query to the database,
and even the aggregation is done in that query.
Other libraries would send various queries, buffer the result and aggregate the result in Javascript.
I see many advantages in proceeding that way:
It reduces a lot the amount of code to write and makes ship-hold a lightweight framework
compared to the others. Moreover the code required for aggregation for example, may be quite complicated and bug prone: less code, less problems!
Database makers strive to improve performances, and write usually faster code in C++, with better algorithms that I personally could do on the JS side.
Any performance improvement on the database engine would therefore improve ship-hold.
Join queries, and sub queries carry a part of redundant data. If you do the aggregation at the database level, I suspect that less data would need to
go throw the wires which might improve the performances compared to a solution where data is buffered on the JS side and cleaned/aggregated.
It could be especially true if your database is on a remote server.
Where would ship hold perform not so well ?
Even though I am trying to sell ship-hold to you, I must be honest. When I built ship-hold, I did not have performances particularly in mind.
I tend to favor first user experience, simplicity of the code and extensible abstractions. The relative good performances is more or less an accident.
As any software, at some point you have to make some trade-offs. This section will explain when ship-hold might not perform well.
When you perform a query including associations, ship-hold uses behind the scene Common Table Expressions.
CTE are useful in many ways but their main advantage is the readability they provide. They are eagerly evaluated (once only) and can be referenced in other part of the query (a bit like variable).
This is very useful when you want to build a query recursively which is the case when you have many associations fetched at many levels of depth.
On the other hand, CTE might not be simplified by Postgres query planner as their inline equivalent (at least for now). The fact that CTE are eagerly evaluated can also be a
disadvantage
If you run the following code
Posts
.select()
.orderBy('published_at','desc')
.where('user_id', 42)
.limit(5)
.include(Comments, Users)
.debug();
You will get the following SQL
WITH
"Posts" AS (SELECT * FROM "posts" WHERE "user_id" = 42 ORDER BY "published_at" DESC LIMIT 5),
"comments" AS (SELECT * FROM "comments" WHERE "comments"."post_id" IN (SELECT "post_id" FROM "Posts")),
"author" AS (SELECT * FROM "users" WHERE "users"."user_id" IN (SELECT "user_id" FROM "Posts"))
SELECT
"Posts".*,
(SELECT (COALESCE(json_agg("comments".*),'[]'::json)) AS "comments"
FROM (SELECT * FROM "comments" WHERE "comments"."post_id" = "Posts"."post_id") AS "comments"),
(SELECT (to_json("author".*)) AS "author"
FROM "author" WHERE "author"."user_id" = "Posts"."user_id") AS "author"
FROM "Posts" ORDER BY "published_at" DESC
This query will have a cost relatively low. The first CTE "Posts"
is referenced quite few times but only evaluated once.
Moreover the scope of the query (posts whose author's id is 42) narrows down a lot the scope of the others dependant CTEs and sub queries.
To sum up, when you quickly narrows down the scope of a select with include (by applying a pagination on the main builder for example), ship-hold's query will perform quite well.
Now let's go back to the example where we wanted to create a dump files of all our posts.
We had an impressive result there. But let's write our code with inclusions now
Posts
.select()
.include(Users)
.stream(function* () {
let first = true;
try {
console.log('[');
while (true) {
const r = yield;
if (first) {
console.log(JSON.stringify(r));
} else {
console.log(',' + JSON.stringify(r));
}
first = false;
}
} catch (e) {
throw e;
} finally {
console.log(']');
}
});
This time, it took us 52 seconds to create the dump file (50x slower) !
Let's have a look at the SQL
WITH
"Posts" AS (SELECT * FROM "posts"),
"author" AS (SELECT * FROM "users" WHERE "users"."user_id" IN (SELECT "user_id" FROM "Posts"))
SELECT
"Posts".*,
(SELECT (to_json("author".*)) AS "author" FROM "author" WHERE "author"."user_id" = "Posts"."user_id") AS "author"
FROM "Posts"
The first CTE is useless: it basically returns the whole posts table and worst, the second evaluates whether an id is in an array of 20 000 items.
In fact it will returns all the users who are also authors (about 8000). Finally our sub query (the last select) will have to match 20 000 posts to those 8000 users.
No wonder it takes time !
To sum up, if your main query (the main builder) would probably returns a big subset, avoid query with inclusions. Luckily for us, ship-hold remains
flexible and you are not out of tools when you spot such a case and need to workaround the issue.