Ironman Engineering: PowerShell Universal Job Log Performance

April 5, 2024

Discuss this Article

As PowerShell Universal has grown in popularity and its use has expanded to larger, more complex systems, we have had to begin to analyze the performance of features of the platform at scale. One of the features that has been a focus of this analysis is the job log. The job log is a feature of PowerShell Universal that allows you to view the output of jobs that have been run. This feature is particularly useful for debugging and troubleshooting purposes. However, as the number of jobs that are run on the platform increases, the performance of the job log can be impacted. In this blog post, we will discuss some of the performance issues that we have identified with the job log and some of the steps that we have taken to address these issues.

History of Persistence in PowerShell Universal

PowerShell Universal started as a combination of Universal Dashboard and Universal Automation. Universal Automation was the job engine and Universal Dashboard was the front end and web server. Because they were distributed as PowerShell modules, we made the decision to use a single file database called LiteDB. We considered SQLite but would have had to ship every supported platform’s binary with the module. This complicated development and distribution. LiteDB was a single file database that was written in C# and could be easily embedded in a PowerShell module.

As the UD\UA platform evolved into PowerShell Universal and we were able to distribute the product via our own channels and built for specific platforms, we had the opportunity to re-evaluate our persistence strategy. We added SQL and, more recently, SQLite support. What we didn’t do was re-evaluate the job log persistence strategy. We continued to use the same mechanism that we had used in Universal Automation to store job logs.

LiteDB v4 Job Log Performance

When we were originally working with LiteDB v4, we found that saving messages that were written from a job to the job log was slow. As the database grew, jobs would take longer to run and even longer to query. We made the decision to use File Storage in LiteDB to improve performance. We found we could quickly write and read from the database and allow for streaming of logs while jobs were running. The implementation was rudimentary. In an effort to allow for structured output, like timestamps and stream types, we would serialize the entire log as a JSON array before storing it in the database.

For each message, such as Write-Host, Write-Warning, or Write-Error, we would concatenate the job log string and write it to the database’s file storage system. This meant that we were writing entries as fast as PowerShell could provide them, without any buffering or batching.

SQL Job Log Performance

When we implemented SQL support, we again used this rudimentary approach. We had a single JobLog table within the SQL database. Every time a job log message was written, we could build a string and perform and UPDATE on the row. PowerShell Universal utilizes Entity Framework for database access. This prevents us from having to write raw SQL queries but also means that we are at the mercy of the performance of Entity Framework.

We started to notice performance issues with job logs when users were running thousands of jobs a day, users that were writing many log messages, or in resource constrained environments like Azure SQL. Symptoms included poor performance of the entire PowerShell Universal server and very large transaction logs in the database.

As jobs ran, especially ones with large amounts of output, the job log would continue to increase in size. Since we were still using the rudimentary approach, we would collect the log message and perform a single UPDATE on the row. This would cause the entire row to be re-written in the database.

Tuning Attempt #1

In our first attempt to tune the database performance we took two steps to avoid changing our strategy per database platform. First, instead of using Entity Framework to update the database each time, we would use a raw SQL query. This would use the CONCAT function in SQL to append the new log message to the existing log message. While this didn’t avoid re-writing the entire row, it did avoid the overhead of sending the entire log over the network to the SQL server.

UPDATE JobLog
SET Log = CONCAT(Log, {jobLog.Log})
WHERE Id = {jobLog.Id}

Second, we had to add a new setting to preformat the log message before it was written to the database. This would allow us to avoid the overhead of serializing the log message to JSON each time. Unfortunately, this meant we lost some of the structured logging capabilities that we had previously.

While this reduced the network load and serialization overhead, it didn’t solve the problem of re-writing the entire row each time a log message was written. The database was still under too much pressure in busy environments.

Tuning Attempt #2

Our second attempt at tuning the job log performance was to batch log messages. Instead of writing each log message to the database as it was written, we would batch log messages and write them to the database in a single transaction. This would reduce the number of times we had to write to the database and would reduce the overhead of writing to the database.

This was accomplished by collecting log messages in memory and writing them to the database when a certain threshold was reached or when the job finished. This still meant we were writing larger and larger logs to the database but it did reduce the number of times we had to write to the database. One downside was that if the server was stopped, we would lose the log messages that were collected in memory and jobs sometimes appeared blank.

This again improved performance but didn’t solve the problem of re-writing the entire row each time a log message was written. We still saw very high transaction log usage. Resource constrained environments like Azure SQL were still very slow and DTU usage was at 100% for a 20 DTU database.

Tuning Attempt #3

After consideration and testing, we decided to implement a new approach to job log persistence. Instead of storing the entire log in a single row in the database, we would store each log message in a separate row in the database. We already had the JobOutput table for the purpose of storing error, warning and progress streams. This would allow us to avoid re-writing the entire row each time a log message was written. SQL INSERT operations are much faster than UPDATE operations and do not require row locks.

We also had time on our side as new versions of LiteDB and Entity Framework were released. We could take advantage of new features in these libraries to improve performance.

LiteDB v5 proved to be much faster than v4.
Entity Framework now has an InsertRange feature that allows for bulk inserts.

While this change proved to be successful, we wanted to ensure that the performance was still adequate. Having tables with millions of records can also prove problematic during queries. Luckily, we don’t have to run any updates on the JobOutput table.

Results

Below are some results from the newly released v4.2.16 version with the new job log persistence strategy.

Test Case

We configured PowerShell Universal to use each of the persistence methods including:

LiteDB
SQLite
SQL
Azure SQL (Remote)

PowerShell Universal was configured to run 10 jobs concurrently. Each job would write 1000 log messages to the job log. Batches of jobs were started on a continuous schedule with a 1 second delay between each batch. Grooming was disabled for the test to ensure the job log would grow indefinitely.

Timing of both job execution and job log retrieval was measured at 500 batch intervals. LiteDB, SQLite and SQL were local to the PSU server. Azure SQL was remote with PSU running locally.

Job execution time was measured based on PSU duration reporting for the job. Retrieval times were measured using Microsoft Edge’s developer tools HTTP timing feature.

System Specs

64 GB RAM
16 Core CPU
Azure SQL Database with 20 DTUs

LiteDB Results

While LiteDB v5 performed better than v4, it was still the slowest of the persistence methods by far. As the job log table increased, so did the execution time and retrieval time. 2 of the jobs failed with the error Error executing job: LiteDB ENSURE: pages in memory store must be non-shared. We also found the job grooming was slow after the test. It took over an hour for the 2.5 million messages to be groomed.

Database Size	Execution	Retrieval
500 jobs, 500K messages	4s	783ms
1000 jobs, 1M messages	7s	1540ms
1500 jobs, 1.5M messages	10s	4490ms
2000 jobs, 2M messages	13s	3100ms
2500 jobs, 2.5M messages	15s	3670ms

SQLite Results

SQLite performed extremely well. It was as fast as the local SQL database and had no job failures. Job grooming was fast due to the ability to run a ranged query to delete old messages.

Database Size	Execution	Retrieval
500 jobs, 500K messages	600ms	10ms
1000 jobs, 1M messages	900ms	17ms
1500 jobs, 1.5M messages	900ms	19ms
2000 jobs, 2M messages	900ms	19ms
2500 jobs, 2.5M messages	900ms	16ms

SQL Results

SQL performed similarly to SQLite. It was fast and had no job failures. Job grooming was also fast. We also ran the SQL tests until 5000 jobs were run.

Database Size	Execution	Retrieval
500 jobs, 500K messages	500ms	17ms
1000 jobs, 1M messages	550ms	24ms
1500 jobs, 1.5M messages	700ms	20ms
2000 jobs, 2M messages	900ms	25ms
2500 jobs, 2.5M messages	1000ms	21ms
5000 jobs, 5M messages	1000ms	20ms

Azure SQL Results

We ran an Azure SQL Standard S1 with 20 DTUs for this test. The PSU test machine was running locally and not in Azure. There was a noticeable latency in the job execution and retrieval times. That said, Azure SQL was very consistent. The average DTU % usage for this test was 30%.

Database Size	Execution	Retrieval
500 jobs, 500K messages	2s	200ms
1000 jobs, 1M messages	2s	200ms
1500 jobs, 1.5M messages	2s	200ms
2000 jobs, 2M messages	2s	400ms
2500 jobs, 2.5M messages	2s	510ms

Conclusion

The new job log persistence strategy has significantly improved the performance of the job log in PowerShell Universal. With previous versions, more limited tests would cause 100% usage of DTUs in Azure SQL. With the new strategy, we were able to run 2500 jobs with 2.5 million messages over about 45 minutes with few issues. We also never experienced any degradation in th PSU service overall.

We will continue to monitor the performance of the database and make further improvements as necessary. In v5, we have already revisited our database layer to better isolate and control transactions in Entity Framework. With the new job log storage format, we can also consider additional optimizations, such as paging and table partitioning, to improve the performance of the job log even further. While we are dropping support for LiteDB in v5, we may also focus a bit on improving the performance in PSU v4 versions since it is the default database in that version. With the introduction of PostgreSQL support in v5, we’ll also be running the same batch of tests to verify similar performance and releasing that information around the v5 release time frame.