
A recent proof-of-concept study concluded that AI systems are promising for pneumothorax detection on chest radiographs but exhibit distinct diagnostic biases that must be carefully matched to the clinical context. Balanced performance models may be suitable for general screening, whereas high-sensitivity models may better support the triage workflow. The findings highlight that rigorous validation, integration strategies, and human supervision remain essential before deployment in real-world clinical practice. This proof-of-concept study was published in December 2025 in the Journal Cureus Introduction Artificial intelligence (AI) systems for detecting pneumothorax are being improved. They can be faster and more accurate than doctors reading radiographs. However, most AI tools only work with images. For instance, a neural network model performed well on test images but not on real hospital images. This highlights the challenge of making AI work across different settings and avoiding such errors. New AI tools on platforms such as Google Cloud Vertex AI work well, even for small pneumothorax cases. They can provide a second opinion. In addition to images, new AI models incorporate text, other data, and even audio. These models may assist with report writing and answering radiology questions. A survey showed that while these models exhibit potential, they also encounter challenges, including inadequate data, inaccuracies, and usability issues. Study Overview The authors conducted a preliminary comparative evaluation of two general-purpose multimodal models, GPT-4o and Gemini 2.5 Pro, for detecting pneumothorax on chest radiographs. This study aimed to benchmark the strengths and limitations of these models and assess the potential clinical contexts in which each approach may be valuable. The evaluation involved 2,000 frontal chest radiographs, evenly divided between cases with and without clinically diagnosed pneumothorax. Both models were queried following a standardized prompt: “Given a frontal chest radiograph, analyze and determine evidence of pneumothorax. Look for visible pleural line without lung markings, deep sulcus sign, lung translucency asymmetry, and collapse signs. The models generated binary predictions for each radiograph: 0 (no pneumothorax) or 1 (pneumothorax present). The predictions were compared with the ground-truth labels to generate confusion matrices showing true positives, true negatives, false positives, and false negatives. The performance metrics included accuracy, precision, recall, and F1 score. Confidence intervals (95%) were used for accuracy and recall, and bootstrapping was used for F1 scores. Key findings: Table 1 shows the comparative performance of the GPT-4o and Gemini models for pneumothorax detection. The matrix highlights GPT-4oโs more balanced performance, with moderate precision and specificity but a higher number of false negatives, which raises concerns for clinical triage scenarios where missed cases are critical. The matrix demonstrates Geminiโs substantially higher recall, indicating stronger sensitivity and reduced precision compared with GPT-4o, reflecting its emphasis on minimizing missed cases of pneumothorax. However, fewer false negatives make it potentially more suitable for early screening contexts, although this comes at the cost of increased false positives and lower precision. Table 1: Comparative performance of GPT-4o and Gemini model for pneumothorax detection (2,000 chest X-rays – 1,000 pneumothorax and 1,000 nonpneumothorax) Performance Metrics GPT-4o model Gemini model Overall Accuracy 64% 62% Precision 66% 55% Recall 57% 88% F1 Score 61% 68% True Positives 571 Pneumothorax cases 879 Pneumothorax cases True Negatives 710 no-pneumothorax cases 275 no-pneumothorax cases False Positives 290 Pneumothorax cases 121 Pneumothorax cases False Negatives 429 no-pneumothorax cases 725 no-pneumothorax cases Potential Clinical Implications The authors highlighted the complementary strengths of GPT-4o and Gemini 2.5 Pro in pneumothorax detection. Pneumothorax is an urgent condition in which delayed recognition can lead to cardiorespiratory collapse. Early identification on chest radiographs, especially in emergency settings, is vital. AI assistance may expedite the detection and reduce errors during high-workload periods. GPT-4o’s balanced performance seems suitable for general screening, whereas Gemini’s high sensitivity is suitable…
Original source: in