Статья опубликована в рамках: Научного журнала «Студенческий» № 7(345)

Рубрика журнала: Информационные технологии

Скачать книгу(-и): скачать журнал часть 1, скачать журнал часть 2

Библиографическое описание:

Sadykov A.N. STATIC APPLICATION SECURITY TESTING (SAST) PROBLEMS: LIMITATIONS, EVALUATION GAPS, AND PRACTICAL MITIGATIONS // Студенческий: электрон. научн. журн. 2026. № 7(345). URL: https://sibac.info/journal/student/345/404363 (дата обращения: 14.07.2026).

STATIC APPLICATION SECURITY TESTING (SAST) PROBLEMS: LIMITATIONS, EVALUATION GAPS, AND PRACTICAL MITIGATIONS

Sadykov Abay Nagmetuly

Student, Department of Information Security, L.N. Gumilyov Eurasian National University,

Kazakhstan, Astana

Durmagambetov Aset Askhatbekovich

научный руководитель,

Scientific supervisor, PhD, senior lecturer, L.N. Gumilyov Eurasian National University,

Kazakhstan, Astana

ABSTRACT

Static Application Security Testing (SAST) is widely used for shift-left security, yet many teams still face low signal-to-noise, missed vulnerabilities in modern frameworks, and inconsistent results between tools and configurations. This article synthesizes recent empirical and practitioner evidence to explain why SAST underperforms on production code and why common benchmark-style evaluations can be misleading. We summarize recurring gaps—configuration sensitivity, limited semantic context, and workflow misalignment—and connect them to practical mitigations such as rule governance, risk-based baselines, developer-centered triage, and complementary dynamic and dependency scanning. The goal is to treat SAST as an engineered process rather than a one-click detector.

Keywords: static application security testing, static analysis, false positives, false negatives, benchmarking, evaluation, CI/CD, secure SDLC.

Introduction

Static Application Security Testing is widely promoted as a foundational ‘shift-left’ control: it enables security teams to detect weaknesses before deployment and enables developers to correct defects when the cost of change is relatively low. Large enterprises adopt SAST because it is automatable, repeatable, and can be applied continuously across very large codebases. A large-scale deployment experience report emphasizes exactly these benefits—automation, repeatability, and scalability—while noting that SAST cannot be fully effective without a supporting organizational strategy and continuous process improvement [4].

Despite widespread adoption, SAST consistently under-delivers on two expectations that many organizations implicitly attach to the ‘shift-left’ narrative. The first expectation is that scanning will accurately identify exploitable vulnerabilities with low operational noise; the second is that higher scan coverage automatically translates into higher application security. Empirical research shows that neither expectation holds in practice. A broad evaluation of Java SAST tools demonstrates that each tool detects only a minority of vulnerabilities on realistic benchmarks and that even the union of multiple tools leaves most vulnerabilities undetected [9]. Similarly, comparative evidence from practitioner-focused studies shows that dynamic and interactive techniques often find different vulnerability subsets, implying that any single SAST tool has structural blind spots [11].

Methodology

The article follows a narrative evidence-synthesis design focused on recurring limitations of SAST in practice. We selected peer‑reviewed empirical studies and recent benchmark‑oriented papers from ACM/IEEE venues, complemented by a systematic literature review, and extracted quantitative indicators (e.g., detection rates, adoption/configuration rates, benchmark sizes) and qualitative themes (e.g., sources of false negatives, barriers to benchmark usage). For each source, we mapped the reported findings to five problem dimensions: intrinsic analysis limits, engineering constraints, evaluation/benchmark gaps, human/organizational factors and cross-checked consistency of conclusions across datasets and contexts. The goal is not to rank specific products, but to characterize systematic failure modes and derive actionable recommendations for tool configuration, CI/CD integration, and benchmark design.

To keep the synthesis actionable, sources were screened for two properties: (1) they reported measurements on realistic codebases (industrial repositories, widely used frameworks, or curated corpora with verifiable ground truth), and (2) they documented tool setup choices (rulesets, language versions, build context, suppressions, or tuning) that could plausibly change outcomes. The analysis therefore emphasizes comparative patterns—how performance shifts when the same tool is reconfigured, when the target code differs in size and dependency structure, and when developers interact with findings through pull requests and issue trackers—rather than absolute numbers from a single benchmark.

Results

The reviewed evidence converges on three stable results. First, recall on realistic vulnerability corpora is often low and varies widely by tool; in a curated production-code dataset, individual tools detected only 11.2–26.5% of known vulnerabilities, while combining four tools reached 38.8%, and a rule‑engineered configuration of Semgrep achieved 44.7% [3]. Second, developer-side usage patterns amplify these technical limits: only 20% of surveyed developers reported using SAST at all, and among SAST users 40% relied on a single tool with default configuration, despite empirical evidence that combining or configuring tools increases coverage [2]. Third, benchmark and evaluation artifacts frequently fail to support reliable tool selection and improvement; practitioners report obstacles such as coarse-grained metrics and poor diagnostic value, and propose clearer assumptions and more actionable reporting [10]. Figures 1–2 and Table 1 summarize the most load‑bearing quantitative indicators and structural benchmark properties.

Figure 1. Vulnerability detection rates reported on production-code datasets and after configuration

Additional findings from the reviewed studies indicate that configuration quality and rule governance often determine whether SAST is perceived as useful or as noise. Reports that many developers run tools “straight out of the box” are consistently associated with higher false-positive rates, faster alert fatigue, and reduced long‑term adoption, while programs with explicit ownership for tuning and suppressions report more stable usage [2]. In practice, this pattern is reinforced when rule changes are centralized as policy-as-code, enabling consistent baselines across teams while local teams focus on remediation.

Figure 2. Developer adoption and configuration practices for SAST tools

Another convergent result is that successful deployments translate SAST output into operationally measurable workflow outcomes rather than an open-ended backlog. Case evidence suggests that a small set of stable indicators—such as time-to-triage for new findings, the percentage of findings ending with accepted fixes versus justified suppressions, and the ratio of recurring findings to truly new ones—helps distinguish tool limitations from process bottlenecks. Tracking these indicators by repository and rule category also reduces “benchmark chasing” where headline precision improves but developers still experience noisy pull requests.

Table 1

Taxonomy of dominant SAST problems and typical mitigations

Problem class	Typical manifestations	Mitigation direction
Intrinsic analysis limits	Approximate semantics, limited context/framework modeling, precision–scalability trade-offs	Framework-aware modeling, targeted rule engineering, hybrid approaches
Engineering constraints	Dependency/build failures, multi-language repos, long scan times, CI/CD friction, incomplete scan coverage	Incremental scanning, caching, scan-health metrics, pipeline staging, standardization
Evaluation and benchmarking gaps	Synthetic benchmarks overestimate performance, ambiguous ground truth, benchmark gaming, non-actionable metrics	Layered benchmark portfolio, diagnostic labeling, per-weakness metrics, transparent methodology
Human and organizational factors	Default configurations, alert fatigue, low security awareness, incentive mismatch, bypassing controls	Governance and ownership, curated rule packs, training, developer-centered reporting and remediation guidance

Empirical accounts also support treating detection rules as governed assets with explicit ownership and lifecycle. Instead of enabling thousands of checks by default, organizations that define a risk-based baseline aligned to the technology stack and threat model report lower alert fatigue and higher predictability of scan behavior. Versioning the baseline, reviewing changes, and retiring or replacing rules that produce systematic false positives or become obsolete after framework upgrades are repeatedly described as practical mechanisms that improve signal quality and auditability.

Workflow integration emerges as a further stable result that influences actionability more than raw detection capability. Studies also emphasize that severity normalization is necessary for effective prioritization: a finding’s priority is most useful when it reflects business impact and exploitability in the actual runtime context, rather than only the syntactic pattern that triggered the rule.

Complimentary is repeatedly reported as essential for meaningful coverage. SAST is strongest for structural issues and early feedback, but several vulnerability classes depend on runtime configuration, deployed dependencies, or environment-specific behavior. Evidence describes improved detection breadth and fewer redundant alerts when SAST is paired with dependency and container scanning, complemented by targeted dynamic testing for high-risk services and endpoints; gaps repeatedly observed in incidents or DAST results are often translated into targeted custom rules or focused detector improvements.

Conclusion

SAST remains indispensable for early and scalable vulnerability discovery, but its problems are persistent and multi-dimensional. Empirical evidence shows low recall on realistic datasets and significant variation across tools, while industrial studies highlight that evaluation artifacts often fail to support actionable selection and improvement [10]. False positives drive alert fatigue and adoption failure; false negatives create residual risk and undermine the meaning of ‘passing’ a scan. Addressing these problems requires coordinated improvements: better evaluation assets, disciplined configuration and rule engineering, workflow-aware integration in CI/CD, and complementary testing layers. Future progress will likely depend as much on benchmarking design and human-centered engineering as on advances in static analysis algorithms.

Recommendations

Practical mitigation begins with acknowledging that SAST is not only a detection algorithm but also a workflow. Large-scale deployment experience recommends embedding scans into development processes so they are as non-disruptive as possible, while institutionalizing manual review. Confirmed false positives should be persistently suppressed so that review effort is paid once. Findings should be prioritized so that high-risk issues are addressed systematically, while lower-risk items are managed through backlog and periodic review [4].

Finally, adoption improves when teams invest in usability and learning. Providing short internal “fix patterns” for recurring findings, mapping rules to common framework idioms, and offering a lightweight escalation path for questionable alerts reduces frustration and increases consistent remediation. For large organizations, a small security engineering function can curate shared configurations and suppression guidelines, while product teams retain autonomy over local exceptions. This balance preserves developer velocity while steadily improving detection quality over time.

Reference:

Bakhshandeh A., Keramatfar A., Norouzi A., Chekidehkhoun M. M (2023). Using ChatGPT as a Static Application Security Testing Tool. *ISeCure: The ISC International Journal of Information Security*. https://arxiv.org/abs/2308.14434
Bennett, G., Hall, T., Winter, E., Counsell, S. J., & Shippey, T. (2024). Do Developers Use Static Application Security Testing (SAST) Tools Straight Out of the Box? A Large-Scale Empirical Study. In *Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2024)*. https://doi.org/10.1145/3674805.3690750
Bennett, G., Hall, T., Winter, E., & Counsell, S. J. (2024). Semgrep*: Improving the Limited Performance of Static Application Security Testing (SAST) Tools. In *Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE 2024)*. https://doi.org/10.1145/3661167.3661262
Brucker, A. D., & Sodan, U. (2014). Deploying Static Application Security Testing on a Large Scale (Experience Report). SAP AG.
Croft, R., Newlands, D., Chen, Z., & Babar, M. A. (2021). An Empirical Study of Rule-Based and Learning-Based Approaches for Static Application Security Testing. In *Proceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2021)*. https://doi.org/10.1145/3475716.3475781
Dalaq, D., & Daya, K. F. (2025). A Systematic Literature Review on Static Application Security Testing (SAST) Tools: Evaluation, Benchmarks, Challenges, and Future Directions. In *Proceedings of the ACM Conference*. https://doi.org/10.1145/3727967.3756838
Dencheva, L. (2022). Comparative analysis of Static application security testing (SAST) and Dynamic application security testing (DAST) by using open-source web application penetration testing tools (MSc internship report). National College of Ireland.
Kuzmina, E., Chattha, S. P., Hosseini, S. E., Shahbaz, M., & Akhunzada, A. (2025). Spring Framework Benchmarking Utility for Static Application Security Testing (SAST) Tools. *IEEE Internet of Things Journal*, *12*(22). https://doi.org/10.1109/JIOT.2025.3598235
Li, K., Chen, S., Fan, L., Feng, R., Liu, H., Liu, C., Liu, Y., & Chen, Y. (2023). Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java. In *Proceedings of ESEC/FSE 2023*. https://doi.org/10.1145/3611643.3616262
Li, Y., Yao, P., Yu, K., Wang, C., Ye, Y., Li, S., Luo, M., Liu, Y., & Ren, K. (2025). Understanding Industry Perspectives of Static Application Security Testing (SAST) Evaluation. *Proceedings of the ACM on Software Engineering*, *2*. https://doi.org/10.1145/3729404
Mateo Tudela, F., Bermejo Higuera, J.-R., Bermejo Higuera, J., Sicilia Montalvo, J.-A., & Argyros, M. I. (2020). On Combining Static, Dynamic and Interactive Analysis Security Testing Tools to Improve OWASP Top Ten Security Vulnerability Detection in Web Applications. *Applied Sciences*, *10*(24), 9119. https://doi.org/10.3390/app10249119