Microsoft Research https://www.microsoft.com/en-us/research/ Mon, 18 Mar 2024 21:04:00 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 Introducing Garnet – an open-source, next-generation, faster cache-store for accelerating applications and services https://www.microsoft.com/en-us/research/blog/introducing-garnet-an-open-source-next-generation-faster-cache-store-for-accelerating-applications-and-services/ Mon, 18 Mar 2024 21:03:58 +0000 Garnet is a cache-store system that addresses growing demand for data storage to support interactive web applications and services. Offering several advantages over legacy cache-stores, Garnet is now available as an open-source download.

The post Introducing Garnet – an open-source, next-generation, faster cache-store for accelerating applications and services appeared first on Microsoft Research.

]]>
Garnet-colored diamond with

Researchers at Microsoft have been working for nearly a decade to address the increasing demand for data storage mechanisms to support the rapid advances in interactive web applications and services. Our new cache-store system called Garnet, which offers several advantages over legacy cache-stores, has been deployed in multiple use cases at Microsoft, such as those in the Windows & Web Experiences Platform, Azure Resource Manager, and Azure Resource Graph, and is now available as an open-source download at https://github.com/microsoft/garnet (opens in new tab). In open sourcing Garnet, we hope to enable the developer community to benefit from its performance gains and capabilities, to build on our work, and to expand the Garnet ecosystem by adding new API calls and features. We also hope that the open sourcing will encourage follow-up academic research and open future collaboration opportunities in this important research area.

The cache-store problem

The growth of cloud and edge computing has brought an increasing number and range of applications and services that need to access, update, and transform data with higher efficiency, lower latencies, and lower costs than ever before. These applications and services often require significant operational spending on storage interactions, making this one of the most expensive and challenging platform areas today. A cache-store software layer, deployed as a separately scalable remote process, can ease these costs and improve application performance. This has fueled a growing cache-store industry, including many open-source systems, such as Redis, Memcached, KeyDB, and Dragonfly.

Unlike traditional remote cache-stores, which support a simple get/set interface, modern caches offer rich APIs and feature sets. They support raw strings, analytic data structures such as Hyperloglog, and complex data types such as sorted sets and hash. They allow users to checkpoint and recover the cache, create data shards, maintain replicated copies, and support transactions and custom extensions.

However, existing systems achieve this feature richness at a cost, by keeping the system design simple, which limits the ability to fully exploit the latest hardware capabilities (e.g., multiple cores, tiered storage, fast networks). Further, many of these systems are not explicitly designed to be easily extensible by app developers or to work well on diverse platforms and operating systems.

Introducing Garnet

At Microsoft Research, we have been investigating modern key-value database architectures since 2016. Our prior work, the FASTER (opens in new tab) embedded key-value library, which we open-sourced (opens in new tab) in 2018, demonstrated orders-of-magnitude better performance than existing systems, while focusing on the simple single-node in-process key-value model.

Starting in 2021, based on requirements from use-cases at Microsoft, we began building a new remote cache-store with all the necessary features to serve as a viable replacement to existing cache-stores. Our challenge was to maintain and enhance the performance benefits that we achieved in our earlier work, but in this more general and realistic network setting.

The result of this effort is Garnet – a new cache-store that offers several unique benefits:

  • Garnet adopts the popular RESP wire protocol as a starting point, which makes it possible to use Garnet from unmodified Redis clients available in most programming languages today.
  • Garnet offers much better scalability and throughput with many client connections and small batches, leading to cost savings for large apps and services.
  • Garnet demonstrates better client latency at the 99th and 99.9th percentiles, which is critical to real-world scenarios.
  • Based on the latest .NET technology, Garnet is cross-platform, extensible, and modern. It is designed to be easy to develop for and evolve, without sacrificing performance in the common case. We leveraged the rich library ecosystem of .NET for API breadth, with open opportunities for optimization. Thanks to our careful use of .NET, Garnet achieves state-of-the-art performance on both Linux and Windows.
Garnet-colored diamond with

API features: Garnet supports a wide range of APIs including raw string, analytical, and object operations described earlier. It also implements a cluster mode with sharding, replication, and dynamic key migration. Garnet supports transactions in the form of client-side RESP transactions (opens in new tab) and our own server-side stored procedures in C# and allows users to define custom operations on both raw strings and new object types, all in the convenience of C#, leading to a lower bar for developing custom extensions.

Network, storage, cluster features: Garnet uses a fast and pluggable network layer, enabling future extensions such as leveraging kernel-bypass stacks. It supports secure transport layer security (TLS) communications as well as basic access control. Garnet’s storage layer, called Tsavorite, was forked from OSS FASTER, and includes strong database features such as thread scalability, tiered storage support (memory, SSD, and cloud storage), fast non-blocking checkpointing, recovery, operation logging for durability, multi-key transaction support, and better memory management and reuse. Finally, Garnet supports a cluster mode of operation – more on this later. 

Performance preview

We illustrate a few key results comparing Garnet to leading open-source cache-stores. A more detailed performance comparison can be found on our website at https://microsoft.github.io/garnet/ (opens in new tab).

We provision two Azure Standard F72s v2 virtual machines (72 vcpus, 144 GiB memory each) running Linux (Ubuntu 20.04), with accelerated TCP enabled. One machine runs different cache-store servers, and the other is dedicated to issuing workloads. We use our own benchmarking tool, called Resp.benchmark (opens in new tab), to generate all results. We compare Garnet to the latest open-source versions of Redis (opens in new tab) (v7.2), KeyDB (opens in new tab) (v6.3.4), and Dragonfly (opens in new tab) (v6.2.11). We use a uniform random distribution of keys in these experiments (Garnet’s shared memory design benefits even more with skewed workloads). The data is pre-loaded onto each server, and fits in memory in these experiments.

Experiment 1: Throughput with varying number of client sessions

We start with large batches of GET operations (4096 requests per batch) and small payloads (8-byte keys and values) to minimize network overhead and compare the systems as we increase the number of client sessions. We see from Figure 1 that Garnet exhibits better scalability than Redis and KeyDB, while achieving higher throughput than all three baseline systems (the y-axis is log scale). Note that, while Dragonfly shows similar scaling behavior as Garnet, it is a pure in-memory system. Further, Garnet’s throughput relative to other systems remains strong when the database size (i.e., the number of distinct keys pre-loaded) is significantly larger, at 256 million keys, than what would fit in the processor caches.

Two clustered column bar graphs comparing the throughput (log-scale) of various systems (Garnet, Redis, KeyDB, and Dragonfly) for a database size of 1024 keys and 256 million keys respectively. The x-axis varies the number of client sessions from 1 to 128. Garnet’s throughput is shown to scale significantly better as the number of client sessions is increased.
Figure 1: Throughput (log-scale), varying number of client sessions, for a database size of (a) 1024 keys, and (b) 256 million keys

Experiment 2: Throughput with varying batch sizes

We next vary the batch size, with GET operations and a fixed number (64) of client sessions. We experiment with two different database sizes as before. Figure 2 shows that Garnet performs better even with no batching, and the gap increases even for very small batch sizes. Payload sizes are the same as before. Again, the y-axis is log scale.

Two clustered column bar graphs comparing the throughput (log-scale) of various systems (Garnet, Redis, KeyDB, and Dragonfly) for a database size of 1024 keys and 256 million keys respectively. The x-axis varies the batch size from 1 to 4096. Garnet’s throughput is shown to benefit significantly even from small batch sizes.
Figure 2: Throughput (log-scale), varying batch sizes, for a database size of (a) 1024 keys, and (b) 256 million keys

Experiment 3: Latency with varying number of client sessions

We next measure client-side latencies for the various systems. Figure 3 shows that, as we increase the number of client sessions, Garnet’s latency (measured in microseconds) at various percentiles stays much more stable and lower as compared to other systems. Here, we issue a mix of 80% GET and 20% SET operations, with no operation batching.

Three clustered column bar graphs comparing the latency of various systems (Garnet, Redis, KeyDB, and Dragonfly) at median, 99th percentile, and 99.9th percentile respectively. The x-axis varies the number of client sessions from 1 to 128, with no batching, and an operation mix of 80% GET and 20% SET. Garnet’s latency is shown to be stable and generally lower across the board.
Figure 3: Latency, varying number of client sessions, at (a) median, (b) 99th percentile, and (c) 99.9th percentile

Experiment 4: Latency with varying batch sizes 

Garnet’s latency is optimized for adaptive client-side batching and many sessions querying the system. We increase the batch sizes from 1 to 64 and plot latency at different percentiles below with 128 active client connections. We see in Figure 4 that Garnet’s latency is low across the board. As before, we issue a mix of 80% GET and 20% SET operations. 

Three clustered column bar graphs comparing the latency of various systems (Garnet, Redis, KeyDB, and Dragonfly) at median, 99th percentile, and 99.9th percentile respectively. The x-axis varies the batch size from 1 to 64, with 128 client sessions connected, and an operation mix of 80% GET and 20% SET. Garnet’s latency is shown to be stable and generally lower across the board.
Figure 4: Latency, varying batch sizes, at (a) median, (b) 99th percentile, and (c) 99.9th percentile

Other experiments

We have also experimented with other features and operation types and found Garnet to perform and scale well. Our documentation (opens in new tab) has more details, including how to run these experiments so that you can see the benefits for your own use cases.

Garnet’s design highlights

Garnet’s design re-thinks the entire cache-store stack – from receiving packets on the network, to parsing and processing database operations, to performing storage interactions. We build on top of years of research, with over 10 research papers published over the last decade. Figure 5 shows Garnet’s overall architecture. We highlight a few key ideas below.

Garnet’s network layer inherits a shared memory design inspired by our prior research on ShadowFax. TLS processing and storage interactions are performed on the IO completion thread, avoiding thread switching overheads in the common case. This approach allows CPU cache coherence to bring the data to the network, instead of traditional shuffle-based designs, which require data movement on the server.

Overall architecture of Garnet. Shows multiple network sessions passing through a parsing and API implementation layer. The storage API is transformed into read, upsert, delete, and read-modify-write operations on the storage layer. Storage consists of a main store and an object store, which both feed into a unified operations log. The log may be relayed to remote replicas.
Figure 5: Overall architecture of Garnet

Garnet’s storage design consists of two Tsavorite key-value stores whose fates are bound by a unified operation log. The first store, called the “main store,” is optimized for raw string operations and manages memory carefully to avoid garbage collection. The second, and optional, “object store” is optimized for complex objects and custom data types, including popular types such as Sorted Set, Set, Hash, List, and Geo. Data types in the object store leverage the .NET library ecosystem for their current implementations. They are stored on the heap in memory (which makes updates very efficient) and in a serialized form on disk. In the future, we plan to investigate using a unified index and log to ease maintenance.

A distinguishing feature of Garnet’s design is its narrow-waist Tsavorite storage API, which is used to implement the large, rich, and extensible RESP API surface on top. This API consists of read, upsert, delete, and atomic read-modify-write operations, implemented with asynchronous callbacks for Garnet to interject logic at various points during each operation. Our storage API model allows us to cleanly separate Garnet’s parsing and query processing concerns from storage details such as concurrency, storage tiering, and checkpointing. 

Garnet further adds support for multi-key transactions based on two-phase locking. One can either use RESP client-side transactions (MULTI/EXEC) or use our server-side transactional stored procedures in C#.

Cluster mode

In addition to single-node execution, Garnet supports a cluster mode, which allows users to create and manage a sharded and replicated deployment. Garnet also supports an efficient and dynamic key migration scheme to rebalance shards. Users can use standard Redis cluster commands to create and manage Garnet clusters, and nodes perform gossip to share and evolve cluster state. Overall, Garnet’s cluster mode is a large and evolving feature, and we will cover more details in subsequent posts.

Looking ahead

As Garnet is deployed in additional scenarios, we will continue to share those details in future articles. We also look forward to continuing to add new features and improvements to Garnet, as well as working with the open-source community.

Project contributors

Garnet Core: Badrish Chandramouli, Vasileios Zois, Lukas Maas, Ted Hart, Gabriela Martinez Sanchez, Yoganand Rajasekaran, Tal Zaccai, Darren Gehring, Irina Spiridonova

Collaborators: Alan Yang, Pradeep Yadav, Alex Dubinkov, Venugopal Latchupatulla, Knut Magne Risvik, Sarah Williamson, Narayanan Subramanian, Saurabh Singh, Padmanabh Gupta, Sajjad Rahnama, Reuben Bond, Rafah Hosn, Surajit Chaudhuri, Johannes Gehrke, and many others.

The post Introducing Garnet – an open-source, next-generation, faster cache-store for accelerating applications and services appeared first on Microsoft Research.

]]>
Exploring how context, culture, and character matter in avatar research https://www.microsoft.com/en-us/research/blog/exploring-how-context-culture-and-character-matter-in-avatar-research/ Mon, 18 Mar 2024 16:00:00 +0000 https://www.microsoft.com/en-us/research/blog/exploring-how-context-culture-and-character-matter-in-avatar-research/ As avatar use expands in digital spaces, advances are required to better represent all people. Discover how research into the varying perceptions of facial animation glitches in low versus high realism scenarios supports this goal.

The post Exploring how context, culture, and character matter in avatar research appeared first on Microsoft Research.

]]>
This research paper was presented at the IEEE VR Workshop Series on Animation in Virtual and Augmented Environments (opens in new tab) (ANIVAE 2024), the premier series on 3D content creation for simulated training in extended reality.

IEEE Conference logo with the paper featured

Face-to-face communication is changing, moving beyond physical interaction to include video conferencing and AR/VR platforms, where the participants are represented by avatars. Sophisticated avatars, animated through motion tracking, can realistically portray their human counterparts, but they can also suffer from noise, such as jitter and distortion, reducing their realism. Advances in motion-capture technology aim to reduce such issues, but they come with higher development costs and require additional time due to the need for advanced components. While some noise is inevitable, it’s important to determine acceptable types and levels to efficiently develop and introduce AR/VR devices and avatars to the market. Additionally, understanding how noise impacts avatar-based communication is essential for creating more inclusive avatars that accurately represent diverse cultures and abilities, enhancing the user experience.

In our paper, “Ecological Validity and the Evaluation of Avatar Facial Animation Noise,” presented at ANIVAE 2024, we explore the challenge of evaluating avatar noise without a standardized approach. Traditional methods, which present participants with isolated facial animation noise to gauge perception thresholds, fall short of reflecting real-life avatar interactions. Our approach emphasizes ecological validity—the extent to which experiments mimic real-world conditions—as central in assessing avatar noise. We discovered this significantly influences participants’ response to avatars, highlighting the impact of context on noise perception. Our goal is to improve avatar acceptance, inclusivity, and communication by developing noise evaluation methods that better represent actual experiences. 

Seeing the big picture  

To set up our study, we animated two avatars using motion capture, as depicted in Figure 1 (A). We recorded the performance of two professional actors enacting a scene between an architect and a client discussing home renovations and examining a 3D model of the proposed design. We used two proprietary characters for the avatars, whose faces were animated with 91 expression blendshapes. This allowed for a broad range of facial expressions and subtle variations in emotions, contributing to a more realistic animation. To examine different dynamics, we created six variations of the scene, changing the characters’ gender, role, and whether they agreed on the renovation plan.

Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.
Figure 1: A. Motion capture of a social interaction scenario for the experiment. B. The motion capture was remapped to stylized avatars. C. Participants experienced the scene wearing a HoloLens 2 and responded to questions on a tablet app. D. The avatars’ facial features were degraded with different types of animation noises of varying severity.

Fifty-six participants engaged in two experiments to evaluate the impact of noise on avatar facial animation. The first experiment had low ecological validity. Participants viewed fragmented clips of dialogue through a Microsoft HoloLens 2 device and used a slider to adjust any noise to an acceptable level. The second experiment featured high ecological validity, showing the scene in its full social context. Here, participants used a HoloLens 2 to judge the noise in facial expressions as either “appropriate” or “inappropriate” for the conversation. In contrast to the first experiment, this method considered the social aspects of context, culture, and character. 

Results indicate that noise was less distracting when participants viewed the scene in its entirety, revealing a greater tolerance for noise in high ecological validity scenarios. Isolated clips, on the other hand, led to greater annoyance with facial animation noise, suggesting the importance of social context over hyper-realistic animation. 

Cultural observations showed that noise perception was influenced by implicit cultural norms, particularly around gender roles and agreement levels. For example, in the second experiment, where participants viewed the conversation within its greater social context (high ecological validity), noise was deemed “appropriate” when the female architect agreed with the male client and “inappropriate” when she disagreed, revealing potential gender biases not observed in reversed gender roles. These findings emphasize the importance of applying high ecological validity in experiments to uncover socio-cultural influences on avatar perception. They also underscore the need to carefully consider context and cultural dynamics in avatar design. 

Finally, we explored the character trait of empathy. Participants with lower empathy scores were more critical of noise in context-rich scenarios. This indicates that experiments focusing solely on low ecological validity might overlook important insights on how empathy influences responses to avatar facial animation noise.

Spotlight: Event Series

Microsoft Research Forum

Join us for a continuous exchange of ideas about research in the era of general AI. Watch Episode 1 & 2 on-demand.

Avatars need to be studied in realistic situations 

When people communicate, they engage in a complex process influenced by environment, cultural background, and the nonverbal cues they perceive and interpret. By prioritizing high ecological validity in studies on avatar perception, researchers can uncover these socio-cultural influences and trust that their findings are relevant and applicable to real-life interactions within digital spaces. 

Our research examines how different combinations of demographic characteristics change the way people react to avatars, and we hope to encourage more inclusivity in avatar design. It’s essential to have an established set of guidelines to achieve this goal, and this work is one step in that direction. While our study’s scope is limited, its methodology can be applied broadly across different devices and settings.

Acknowledgements

We would like to thank Ken Jakubzak, James Clemoes, Cornelia Treptow, Michaela Porubanova, Kerry Read, Daniel McDuff, Marina Kuznetsova and Mathew Lamb for their research collaboration. We would also like to thank Shawn Bruner for providing the characters for the study and Panagiotis Giannakopoulos for leading the animation and motion capture pipelines.

The post Exploring how context, culture, and character matter in avatar research appeared first on Microsoft Research.

]]>
Scaling early detection of esophageal cancer with AI https://www.microsoft.com/en-us/research/blog/scaling-early-detection-of-esophageal-cancer-with-ai/ Mon, 11 Mar 2024 10:00:00 +0000 https://www.microsoft.com/en-us/research/blog/scaling-early-detection-of-esophageal-cancer-with-ai/ Microsoft Research and Cyted have collaborated to build novel AI models (opens in new tab) to scale the early detection of esophageal cancer. The AI-supported methods demonstrated the same diagnostic performance as the existing manual workflow, potentially reducing the pathologist’s workload by up to 63%. Esophageal cancer is the sixth most common cause of cancer […]

The post Scaling early detection of esophageal cancer with AI appeared first on Microsoft Research.

]]>
white icons of first aid kit, DNA strand, laptop monitor with overlapping eye, and microscope on a blue and green gradient background

Microsoft Research and Cyted have collaborated to build novel AI models (opens in new tab) to scale the early detection of esophageal cancer. The AI-supported methods demonstrated the same diagnostic performance as the existing manual workflow, potentially reducing the pathologist’s workload by up to 63%.

Esophageal cancer is the sixth most common cause of cancer deaths worldwide, in part because this disease is typically diagnosed late, making treatment difficult. Fewer than 1 in 5 patients survive five years after diagnosis, making early detection of this disease critical to improving a patient’s chances. One opportunity for early detection is to identify patients with a condition called Barrett’s esophagus (BE). Patients with BE are at an increased risk of developing cancer, though most never will. Chronic heartburn is a risk factor and a possible cause of Barrett’s.

Detecting BE dramatically improves a patient’s chances. Earlier detection of cancer and earlier start of treatment mean that more than 9 in 10 patients survive 5 years after diagnosis. However early detection of BE has typically involved an endoscopic biopsy, a procedure that many people find uncomfortable and invasive. It often requires sedation, is resource intensive, and increases the risk of complications.

A major step toward enabling large-scale screening for BE has been spearheaded by Cyted (opens in new tab), a start-up company at the forefront of medical innovation. Cyted has developed a capsule sponge device called EndoSign (opens in new tab)® – a dissolvable capsule on a string that expands into a small medical sponge once in the stomach. When pulled back out, it collects cells from the lining of the esophagus, which are then processed, placed on slides, stained, and scanned for digital analysis. 

The capsule sponge is easier to administer and less costly than endoscopy. But a pathologist still needs to review the digitized slides to determine the presence of any goblet cells, a type of cell normally found in the intestinal lining, which would indicate BE if found in the esophagus. These images are huge (up to 100,000 by 100,000 pixels – the size of a squash court if printed at the typical photo resolution of 300dpi) – yet may contain only a few goblet cells per image, each cell just a few pixels large. To identify BE, pathologists need to use slides from two stains, H&E (a routine stain for observing cell structure) and TFF3 (a special stain just to find goblet cells). Since most patients with heartburn will not have BE, pathologists spend most of their time examining negative cases, taking away time in which they could be prioritizing high-risk cases without more sophisticated approaches to analysis.

Microsoft Research and Cyted have collaborated to build novel AI models that can efficiently check the slides for goblet cells, using either the H&E or TFF3 stains. This joint effort has led to a Nature Communications paper titled “Enabling large-scale screening of Barrett’s esophagus using weakly supervised deep learning in histopathology (opens in new tab).” Our study uses the strength of transformer-based multiple instance learning to assist in the screening of BE. In the paper, we introduce two major innovations. First, we show that the AI models can be built solely from the pathologists’ findings on whether BE is present, eliminating the need for expensive pixel-level annotations. This means that existing large capsule sponge screening datasets can be used to further improve the performance of the model. Secondly, we demonstrate that goblet cells can be detected with high accuracy using only the H&E slides. This is the most common routine stain in pathology, and it suggests that the more time-consuming and costly specialized staining, TFF3, could be skipped (see Figure 2 below).

Fig1
Figure 1: The top-left contains a thumbnail image of an H&E slide with goblet cells. In the bottom left, the attention maps of the AI model show which image regions the model uses to make its final prediction. Zooming in to those areas (bottom right), we see that image parts that receive high attention contain goblet cells. We validate that these are indeed goblet cells by looking at the corresponding TFF3 slide (top right), where goblet cells are shown as brown.

In the paper, we further discuss different AI-assisted workflows designed to optimize the screening process. The first workflow necessitates a pathologist’s review only if either the H&E or TFF3 models predict a sample as positive. This method can achieve the same diagnostic performance as the existing manual workflow in terms of sensitivity and specificity, potentially reducing the pathologist’s workload by 52% (see Figure 3 below).

The second proposed workflow reduces the need for pathologist review by 63% of the original load, by restricting reviews to positive predictions from the H&E model only. However, this comes at slightly reduced sensitivity, since goblet cells are more clearly visible in the TFF3 stain.

diagram
Figure 2: Proposed AI-assisted workflows. a) Workflow “Pathologist reviews any positives” b) Workflow “Pathologist reviews H&E model positives”
Proposed AI-assisted workflowPathologist review (per-cent of all cases)TFF3 staining required (per-cent of all cases)Sensitivity @ Specificity 1.00
Pathologist reviews any positive48%100%1.00
Pathologist reviews H&E model positives37%37%0.91
Figure 3: Quantitative comparison of the proposed workflows. For the two workflows described in Figure 2, we compare the pathologist workload as a fraction of the total number of cases, the amount of images for which a costly TFF3 stain is required, and the resulting accuracy numbers.

Our collaboration with Cyted demonstrates the transformative potential of integrating advanced AI models into clinical workflows, saving valuable time for pathologists. As we move forward, the scalability of this technology holds the promise for widespread adoption of early detection in the fight against esophageal cancer.

“This represents a significant step in our fight against esophageal cancer, offering the potential to save countless lives through early detection with our minimally-invasive capsule sponge technology,” said Cyted CEO Marcel Gehrung. “Our collaboration with Microsoft Research has been instrumental in pushing the boundaries of what’s possible in medical imaging and screening technologies, creating optimal efficiencies from start to finish of the testing process.”

We have open sourced code to build these models (opens in new tab), which is designed to be scalable to very large datasets, using Azure Machine Learning (opens in new tab). This flexibility allows other researchers and institutions to adapt and enhance our code according to their specific needs. Importantly, our code represents a significant advancement over previous work in the field. Unlike earlier approaches that focused solely on training the multiple instance and attention layers, our code allows for end-to-end fine-tuning, including the image encoder. This comprehensive approach to training ensures optimal performance and accuracy, setting a new standard for AI models in histopathology. 

“The open sourcing of this code has helped us to advance our research in the field of early cancer detection,” said Florian Markowetz, Professor of Computational Oncology at the University of Cambridge, and Senior Group Leader at Cancer Research UK Cambridge Institute. “Several key features will soon be integrated into ongoing clinical trials, where we aim to improve the detection of Barrett’s esophagus in patients and ultimately treat more cancers through early intervention. Furthermore, these features will help improve the workflow of pathologists and identify key regions quicker, enabling clinicians to tackle more cases with greater reliability.”

By sharing our work, we aim not only to enhance the detection of BE and esophageal cancer, but also to empower researchers and clinicians around the world to leverage this technology in their fight against cancer[1]. Because our code can be used as a building block to develop AI models for histopathology slides, it may also potentially be applied to other cancer types. It is our hope that this open-source initiative will foster innovation and collaboration, and ultimately lead to breakthroughs that save lives.

As researchers, it has been exciting to work closely with Cyted and be part of the long path towards early detection of esophageal cancer. Cross-discipline collaborations like this are excellent opportunities to solve complex clinical problems. With AI models built using the principles of responsible AI like fairness, privacy and security, and reliability and safety, we can ultimately make a tangible difference to patient outcomes.

Acknowledgement

Thank you to the team: Kenza Bouzid, Harshita Sharma, Sarah Killcoyne, Daniel C. Castro, Anton Schwaighofer, Max Ilse, Valentina Salvatelli, Ozan Oktay, Sumanth Murthy, Lucas Bordeaux, Luiza Moore, Maria O’Donovan, Anja Thieme, Hannah Richardson, Aditya Nori, Marcel Gehrung, Javier Alvarez-Valle


[1] (opens in new tab) Code released for research use only. Full disclaimer here: https://github.com/microsoft/be-trans-mil (opens in new tab)

The post Scaling early detection of esophageal cancer with AI appeared first on Microsoft Research.

]]>
Improving LLM understanding of structured data and exploring advanced prompting methods https://www.microsoft.com/en-us/research/blog/improving-llm-understanding-of-structured-data-and-exploring-advanced-prompting-methods/ Thu, 07 Mar 2024 17:00:00 +0000 https://www.microsoft.com/en-us/research/?p=1004586 Structural Understanding Capabilities is a new benchmark for evaluating and improving LLM comprehension of structured table data. This advance can help LLMs process and analyze data more effectively, broadening their applicability in real-world tasks.

The post Improving LLM understanding of structured data and exploring advanced prompting methods appeared first on Microsoft Research.

]]>
This research paper was presented at the 17th ACM International Conference on Web Search and Data Mining (opens in new tab) (WSDM 2024), the premier conference on web-inspired research on search and data mining.

WSDM logo in white to the left of the first page of the

In today’s data-driven landscape, tables are indispensable for organizing and presenting information, particularly text. They streamline repetitive content, enhance data manageability, enable easier data analysis, and improve machine processing capabilities. Meanwhile, large language models (LLMs) are advancing in their ability to tackle challenges associated with natural language, but the degree to which they understand tables included in their prompts remains an open question. Our research aims to explore this question and improve how LLMs use and work with table-based data.

Our paper, “Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study (opens in new tab),” presented at WSDM 2024 (opens in new tab), investigates what kinds of prompts most effectively enable LLMs to understand tables; how much LLMs inherently detect structured data; and how LLMs’ existing knowledge can be harnessed to improve this understanding. We also analyze the complex trade-off among multiple combinations of input designs and overall performance.

To address these questions, we propose a new benchmark called Structural Understanding Capabilities (SUC), shown in Figure 1 (a), which focuses on specific tasks to assess LLMs’ ability to understand structured data in tables and compare different types of prompts. We conducted a series of experiments using different prompt designs. Our findings, detailed in the paper, evaluate how each design enhances LLMs’ ability to work with tables. 

The image (a) is a flowchart with three main columns that illustrate the stages, capabilities, and tasks associated with a process benchmarked by SUC (Semantic Understanding Capability), and their application in input designs. Here is the detailed alt text for the image: Flowchart illustrates the detailed design of the Semantic Understanding Capability Benchmark. The leftmost column is labeled 'Stages' with two main stages: 'Partition & Parsing' in blue and 'Search & Retrieval' in pink. Each stage is associated with 'Capabilities' in the middle column. 'Partition & Parsing' includes 'Structural Description Detection', 'Format Understanding', and 'Hierarchy Detection'. 'Search & Retrieval' includes 'Grounding/Locating' and 'Operation Reasoning'. These capabilities correspond to 'Tasks' in the third column. For 'Partition & Parsing', tasks are 'Table Partition', 'Table Size Detection', and 'Hierarchy Detection'. For 'Search & Retrieval', tasks are 'Cell Lookup & Reverse Lookup' and 'Column & Row Retrieval'.  

 

To the right of these columns is image (b) labeled 'Input Designs' connected to 'Partition Mark', 'Serialization', 'Role Prompting', 'Order Permutation', and 'Format Explanation'. These are further linked to types of 'Markup Languages' represented in green boxes: 'HTML', 'XML', 'Markdown', and more indicated by ellipses. Image (b) covers the input designs for the SUC evaluation.
Figure 1. The SUC benchmark and prompt designs for evaluation.

Insights and findings using the SUC benchmark

Based on humans’ perception of tables, we developed tasks to evaluate how LLMs understand them. We conducted evaluations on GPT-3.5 and GPT-4 and discovered that the results depended on certain input factors, such as table format, content order, and partition marks. The findings, detailed in Tables 1 and 2, reveal some notable and unexpected findings:

  • Delimiter-separated formats (e.g., CSV, TSV), underperformed compared with HTML by 6.76 percent.
  • Using HTML and few-shot learning consistently improved performance. The effectiveness of other approaches, such as format explanation, role prompting, order change, and partition marks, varied depending on task difficulty and the required capacity.
  • Despite the simplicity of the benchmark tasks, the highest overall accuracy across seven tasks is only 65.43 percent. This underscores the need for LLMs to have better awareness of table structures and highlights areas for further improvement in table serialization.

Our exploration suggests that:

  • LLMs have a basic understanding of table structures but are far from perfect, even in straightforward tasks like detecting the number of columns and rows.
  • Choosing the right combination of input designs can significantly enhance LLMs’ understanding of structured data.

Our findings revealed significant performance gaps in downstream tasks, attributed to the different combinations of serialization functions and input options. These gaps remained even with GPT-4, underscoring the effectiveness of our benchmark approach.

This is a table regarding the comparison table displaying the accuracy (Acc) of GPT-4 versus previous models in different tasks. Tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The data formats compared are NL + Sep, Markdown, JSON, XML, and HTML. GPT-4 shows improved accuracy across nearly all tasks and formats compared to its predecessors, with notable high accuracy in the HTML format for Table Partition and Merged Cell Detection tasks.
Table 1. SUC benchmark evaluations on table formats.
This table presents the comparison of accuracy (Acc) and changes in accuracy (Δ) for different input designs using GPT-4 on various tasks. The tasks include Table Partition, Cell Lookup, Reverse Lookup, Column Retrieval, Row Retrieval, Size Detection, and Merged Cell Detection. The input designs tested are Markup Language HTML with and without various components such as format explanation, partition mark, role prompting, and change order, as well as without 1-shot learning. The last row shows the performance of GPT-4 with Language HTML. The table displays positive and negative changes in percentages with respective tasks, highlighting the impact of each input design modification on the model's accuracy.
Table 2. Ablation study of input designs using the SUC benchmark.

Improved performance with self-augmented prompting

Based on these benchmark evaluations, we investigated how LLMs’ existing knowledge could be used to enhance their understanding of structured data. To do this, we introduced self-augmentation, a model-agnostic technique that improves structural prompting—enabling LLMs to identify key values and ranges by tapping into their own internal knowledge. This technique simplifies and optimizes how LLMs utilize their existing knowledge base to improve their understanding of structured content, allowing them to generate intermediate structural insights. This process is shown in Figure 2, with the results detailed in Table 3.

The image depicts a diagram showing the Self-augmented Prompting workflow that involves an initial table, an intermediate output, and a final output. Here is the detailed alt text for the image: On the left, there's a table with the title 'Antoine Salamin' and columns labeled 'Year', 'Team', 'Driver', 'Races', and 'Pos'. Two rows are visible with the years 1983 and 1989, team name starting with 'Swit...', driver name starting with 'Antoine...', and positions '29th' and '7th' highlighted in the last visible row. Below the table is a box labeled 'Table & Other info' and an arrow pointing right labeled '1st request' with the text 'Identify critical values and ranges of the table'. 

 

In the center, a green box with rounded corners titled 'Intermediate Output' contains text summarizing the table's content, mentioning Antoine Salamin's results from 1983 to 1989, the number of races, podiums, and points range. There's an arrow looping back to the first box with 'LLM' written above it, indicating a feedback loop for further processing. 

 

On the right, a blue box with rounded corners titled 'Final Output' contains a narrative description saying 'In 1989, Antoine Salamin drove a Porsche 962C for the Swiss Team Salamin, powered by a Porsche turbo Flat-6 engine. He competed in two races, achieving one podium and 17 points, finishing 7th overall.' An arrow labeled '2nd request' points from the '1st request' to the 'Intermediate Output' and another from there to the 'Final Output', indicating the sequence of processing requests.
Figure 2. Self-augmented prompting.
This table is comparing the accuracy (Acc) and BLEU scores for different types of input choices on various question-answering datasets: TabFact, HybridQA, SQA, Feverous, and ToTTo. The types include 1-shot and self-explanation approaches (SA) with various modifications such as without table size, partition mark, format explanation, role prompting, critical values and ranges identification, and structural information description. Each row shows the impact of these modifications on the model's performance, with accuracy percentages for the datasets and BLEU-1 to BLEU-4 scores for the ToTTo dataset.
Table 3. Evaluation of downstream tasks. “SA” refers to self-augmented prompting.

Looking forward

Our study sets a key benchmark in expanding the capabilities of LLMs to better understand structured table data, moving beyond conventional natural language processing tasks. We suggest future research should prioritize the integration of structural information to improve performance with various structured data types. Additionally, we propose exploring LLMs’ ability to use external tools or agents for improved handling of structured data, opening new avenues for application.

The post Improving LLM understanding of structured data and exploring advanced prompting methods appeared first on Microsoft Research.

]]>
Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies https://www.microsoft.com/en-us/research/blog/research-forum-episode-2-transforming-health-care-and-the-natural-sciences-ai-and-society-and-the-evolution-of-foundational-ai-technologies/ Wed, 06 Mar 2024 20:12:15 +0000 https://www.microsoft.com/en-us/research/?p=1011936 Research advances are driving real-world impact faster than ever. Episode 2 of Microsoft Research Forum explores how AI is transforming health care and the natural sciences, the intersection of AI and society, and the evolution of foundational AI technologies.

The post Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies appeared first on Microsoft Research.

]]>
Chris Bishop at Research Forum

Research advances are driving real-world impact faster than ever. Recent developments in AI are reshaping the way people live, work, and think. In the latest episode of Microsoft Research Forum (opens in new tab), we explore how AI is transforming health care and the natural sciences, the intersection of AI and society, and the continuing evolution of foundational AI technologies. 

Below is a brief recap of the event, including select quotes from the presentations. Full replays of each session and presentation will be available soon.

Keynote: The Revolution in Scientific Discovery

Chris Bishop, Technical Fellow and Director, Microsoft Research AI4Science 

As in our debut event on January 30, this edition of Research Forum began with a keynote address by a leader from Microsoft Research. Chris Bishop shared some exciting real-world progress being made by his team toward modelling and predicting natural phenomena.

Chris Bishop: “In my view, the most important use case of AI will be to scientific discovery. And the reason I believe this is that it’s our understanding of the natural world obtained through scientific discovery, together with its application in the form of technology that has really transformed the human species.”

Panel discussion: Transforming the natural sciences with AI

Bonnie Kruft, Partner Deputy Director, Microsoft Research AI4Science (Host)
Rianne van den Berg, Principal Research Manager, Microsoft Research AI4Science 
Tian Xie, Principal Research Manager, Microsoft Research AI4Science 
Tristan Naumann, Principal Researcher, Microsoft Research Health Futures 
Kristen Severson, Senior Researcher, Microsoft Research New England 
Alex Lu, Senior Researcher, Microsoft Research New England

In a discussion hosted by Bonnie Kruft, Microsoft researchers presented their latest advancements in the fields of foundation models, drug discovery, material design, and machine learning. Panelists highlighted deep learning’s growing impact on the natural sciences.

Tristan Naumann: “Much of the data we have in healthcare is not nicely structured in a clean and easy to use way. And so, one of the things that’s really incredible about some of these recent advances in generative AI, specifically large language models (LLMs) and multimodal models, is really this opportunity to have a tool for universal structuring. And unlocking some of that data quickly and efficiently really opens up a lot of new opportunities.”

Tian Xie: “Similar (to) the field of health and in biology, machine learning is really beginning to interrupt some of the traditional pipelines that happened in materials discovery.”

Kristen Severson: “We have a lot of knowledge about diseases and how they manifest and we don’t want to leave that information on the table when we train a machine learning model. So, there’s not an interest in using solely black box approaches, but instead (in) using what’s already known.”

Alex Lu: “If you look at what particularly differentiates biology and I suspect by extension a lot of other scientific disciplines, the whole point is to try to discover something new. So, by definition, what that new thing is is not going to be captured in your original distribution of data.” 

Rianne van den Berg: “One particular class of generative models that I’m very excited about and that’s becoming increasingly popular is that of diffusion models and score-based generative models. These models have been super successful already, for instance in high resolution image generation and video, and they’re also very naturally suited to target scientific discovery.” 

Lightning talk: What’s new in AutoGen? 

Chi Wang, Principal Researcher, Microsoft Research AI Frontiers 

Chi Wang presented the latest updates on AutoGen – the multi-agent framework for next generation AI applications. The discussion covered milestones achieved, community feedback, exciting new features, and the research and related challenges on the road ahead. He also announced a recent milestone. 

Chi Wang: “Our initial multiagent experiment on the challenging GAIA benchmark turned out to achieve the number one accuracy in the leaderboard in all three levels. That shows the power of AutoGen in solving complex tasks and big potential.”

Lightning talk: The metacognitive demands and opportunities of generative AI

Lev Tankelevitch, Senior Behavioral Science Researcher, Microsoft Research Cambridge (UK)

Lev Tankelevitch explored how metacognition—the psychological capacity to monitor and regulate one’s thoughts and behaviors—provides a valuable lens for understanding and addressing the usability challenges of generative AI systems. This includes prompting, assessing and relying on outputs, and workflow optimization, which require a high degree of metacognitive monitoring and control.

Lev Tankelevitch: “We believe that a metacognitive perspective can help us analyze, measure, and evaluate the usability challenges of generative AI, and it can help us design generative AI systems that can augment human agency and workflows.”

Lightning talk: Getting modular with language models: Building and reusing a library of experts for task generalization

Alessandro Sordoni, Principal Researcher, Microsoft Research Montreal

Alessandro Sordoni discussed recent research on building and re-using large collections of expert language models to improve zero-shot and few-shot generalization to unseen tasks.

Alessandro Sordoni: “Looking forward, I believe that an exciting direction would be to push this to fully decentralized training and continual improvement of language models in the sense that users can train their experts, then share them in the platform and the model gets better.” 

Lightning talk: GigaPath: Real-World Pathology Foundation Model

Naoto Usuyama, Principal Researcher, Microsoft Research Health Futures

Naoto Usuyama presented GigaPath, a novel approach for training large vision transformers for gigapixel pathology images, utilizing a diverse, real-world cancer patient dataset, with the goal of laying a foundation for cancer pathology AI.

Naoto Usuyama: “This project (GigaPath) is not possible without many, many collaborators, and we are just scratching the surface. So, I’m very excited, and I really hope we can unlock the full potential of real-world patient data and advanced AI for cancer care and research.”

Lightning talk: Generative AI and plural governance: Mitigating challenges and surfacing opportunities

Madeleine Daepp (opens in new tab), Senior Researcher, Microsoft Research Redmond
Vanessa Gathecha (opens in new tab), Applied Researcher and Policy Analyst, Baraza Media Lab

This talk featured two expert speakers. Madeleine Daepp discussed the potential impacts and challenges of generative AI in a year with over 70 major global elections. Vanessa Gatheca, a 2024 Microsoft AI and Society fellow (opens in new tab), discussed her work on disinformation in Kenya and Sub-Saharan Africa.

Madeleine Daepp: “The disruption of our digital public sphere is an all-of-society problem that requires an all-of-society response. The AI and Society fellows program is helping to build much needed connections across places, across academic disciplines, and across societal sectors to help us understand the problem and work toward an impactful response.” 

The post Research Forum Episode 2: Transforming health care and the natural sciences, AI and society, and the evolution of foundational AI technologies appeared first on Microsoft Research.

]]>
Research Focus: Week of March 4, 2024 https://www.microsoft.com/en-us/research/blog/research-focus-week-of-march-4-2024/ Wed, 06 Mar 2024 17:00:00 +0000 https://www.microsoft.com/en-us/research/?p=1010790 In this issue: Generative kaleidoscopic networks; Text diffusion with reinforced conditioning; PRISE – Learning temporal action abstractions as a sequence compression problem.

The post Research Focus: Week of March 4, 2024 appeared first on Microsoft Research.

]]>

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus 
Week of March 4, 2024

Generative Kaleidoscopic Networks

Neural networks are deep learning models that can be trained to learn complex patterns and relationships within data. In a recent paper: Generative Kaleidoscopic Networks, researchers from Microsoft detail how they discovered an “over-generalization” phenomenon, which indicates that the neural networks tend to learn many-to-one mappings. They then use this phenomenon to introduce a new paradigm of generative modeling by creating a dataset kaleidoscope, dubbed ‘Generative Kaleidoscopic Networks.’ The researchers are exploring theoretical explanations, experiments on multimodal data, and conditional generation using the Generative Kaleidoscopic Networks.

MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.
MNIST Kaleidoscope: Manifold learning is done on the MNIST data images with a Multilayer Perceptron model. We start with input noise vector sampled from a Uniform distribution and run the kaleidoscopic sampling algorithm. The transitioning between images demonstrate a kaleidoscopic effect until eventually the samples found a stable minima and converge at a digit.

MICROSOFT RESEARCH PODCAST

AI Frontiers: The future of scale with Ahmed Awadallah and Ashley Llorens

This episode features Senior Principal Research Manager Ahmed H. Awadallah, whose work improving the efficiency of large-scale AI models and efforts to help move advancements in the space from research to practice have put him at the forefront of this new era of AI.

Text Diffusion with Reinforced Conditioning

Diffusion models are a type of machine learning model that have shown exceptional ability to generate high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they offer potential for achieving better non-autoregressive sequence generation—which simultaneously predicts all elements of a sequence, rather than predicting the next element in a sequence.

However, existing text diffusion models have yet to fulfill this potential, due to challenges in handling the discreteness of language. In a recent paper: Text Diffusion with Reinforced Conditioning, researchers from Microsoft and external colleagues uncover two significant limitations in text diffusion models: degradation of self-conditioning during training and misalignment between training and sampling. In response, the researchers propose a novel model called TREC, which empowers text diffusion models with reinforced conditioning, mitigating the degradation by directly motivating quality improvements from self-conditions with reward signals. In the paper, which was presented at the 2024 Association for the Advancement of Artificial Intelligence conference (AAAI), they further propose time-aware variance scaling to address the misalignment issue.

Extensive experiments demonstrate the competitiveness of TREC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.


PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem

Temporal action abstractions, along with belief state representations, are powerful knowledge sharing mechanisms for sequential decision making. In a recent paper, PRISE: Learning Temporal Action Abstractions as a Sequence Compression Problem, researchers from Microsoft and University of Maryland propose a novel connection between the seemingly distant realms of training large language models (LLMs) and inducing temporal action abstractions for continuous control domains such as robotics. The researchers introduce an approach called Primitive Sequence Encoding (PRISE) that combines continuous action quantization with a subtle but critical component of LLM training pipelines — input tokenization via byte pair encoding (BPE) – to learn powerful variable-timespan action abstractions. They empirically show that high-level skills discovered by PRISE from a multitask set of robotic manipulation demonstrations significantly boost the performance of both multitask imitation learning and few-shot imitation learning on unseen tasks.

The post Research Focus: Week of March 4, 2024 appeared first on Microsoft Research.

]]>
Orca-Math: Demonstrating the potential of SLMs with model specialization https://www.microsoft.com/en-us/research/blog/orca-math-demonstrating-the-potential-of-slms-with-model-specialization/ Tue, 05 Mar 2024 14:00:00 +0000 https://www.microsoft.com/en-us/research/?p=1010406 Microsoft’s Orca-Math, a specialized small language model, outperforms much larger models in solving math problems that require multi-step reasoning and shows the potential of using feedback to improve language models. Learn more.

The post Orca-Math: Demonstrating the potential of SLMs with model specialization appeared first on Microsoft Research.

]]>
abstract wave lines on a gradient background

Our work on Orca and Orca 2 demonstrated the power of improved training signals and methods to enhance the reasoning abilities of smaller language models, getting closer to the levels found in much larger language models. Orca-Math is another step in this direction, where we explore the capabilities of small language models (SLMs) when specialized in a certain area, in this case solving grade school math problems, which has long been recognized as a complex task for SLMs.

Orca-Math is a 7 billion parameters model created by fine-tuning the Mistral 7B model. Orca-Math achieves 86.81% on GSM8k pass@1, exceeding the performance of much bigger models including general models (e.g. LLAMA-2-70, Gemini Pro and GPT-3.5) and math-specific models (e.g. MetaMath-70B and WizardMa8th-70B). Note that the base model (Mistral-7B) achieves 37.83% on GSM8K.

Alt Text: Bar graph comparing GSM8K score of different models with an upward trend in quality. The models are LLAMA-2-70, GPT-3.5, Gemini Pro,  WizardMath-70B, MetaMath-70B and Orca-Math-7B. The graph shows that the Orca-Math-7B model outperforms other bigger models on GSM8K.
 Bar graph comparing GSM8K score of different models with an upward trend in quality. The models are LLAMA-2-70, GPT-3.5, Gemini Pro,  WizardMath-70B, MetaMath-70B and Orca-Math-7B. The graph shows that the Orca-Math-7B model outperforms other bigger models on GSM8K.

The state-of-the-art (SOTA) performance of the Orca-Math model can be attributed to two key insights:

  • Training on high-quality synthetic data with 200,000 math problems, created using multi-agents (AutoGen). This is smaller than other math datasets, which could have millions of problems. The smaller model and smaller dataset mean faster and cheaper training.
  • In addition to traditional supervised fine-tuning, the model was trained using an iterative learning process, where the model is allowed to practice solving problems and continues to improve based on feedback from a teacher.

Our findings show that smaller models are valuable in specialized settings, where they can match the performance of much larger models while also highlighting the potential of continual learning and using feedback to improve language models. We are making the dataset (opens in new tab) publicly available, along with a report (opens in new tab) describing the training procedure to encourage research on the improvement and specialization of smaller language models.

Teaching SLMs math

Solving mathematical word problems has long been recognized as a complex task for SLMs. Models that achieve over 80% accuracy on the GSM8K benchmark (GSM8K, which stands for Grade School Math 8K, is a dataset of 8,500 high-quality grade school mathematical word problems that require multi-step reasoning) typically exceed 30 billion parameters.

To reach higher levels of performance with smaller models, researchers often train SLMs to generate code, or use calculators to help avoid calculation errors. Additionally, they employ a technique called ensembling, in which the model is called up to 100 times, with each call reattempting to solve the problem. Ensembling provides a substantial boost in accuracy but at a significant increase in compute cost increase, due to multiple calls to the model. 

This research aims to explore how far we can push the native ability of smaller language models when they are specialized to solve math problems, without the use of external tools, verifiers or ensembling. More specifically, we focus on two directions:

AgentInstruct

Previous work on synthetic data creation often uses frontier models to generate similar problems based on a seed problem. Providing paraphrases of the seed with different numbers and attributes can be useful for creating training data for the smaller model. We propose employing multi-agent flows, using AutoGen, to create new problems and solutions, which can not only create more demonstrations of the problem but also increase the diversity and range of difficulty of the problems. 

To generate more challenging problems, we create a setup with a team of agents working collaboratively to create a dataset geared toward a predefined objective. For example, we can use two agents, namely Suggester and Editor. The Suggester examines a problem and proposes several methods for increasing its complexity, while the Editor takes the original word problem and the Suggester’s recommendations to generate an updated, more challenging problem. This iterative process can occur over multiple rounds, with each round further increasing the complexity of the previously generated problem. A third agent can then verify that the problem is solvable and create the solution.

Iterative learning

Using high-quality training data that may elicit richer learning signals (e.g. explanations) has been shown to significantly improve SLM’s abilities in acquiring skills that had only emerged before at much larger scale.

This paradigm fits under a teacher-student approach where the large model (the teacher) is 
creating demonstrations for the SLM (the student) to learn from. In this work, we extend the teacher-student paradigm to iterative learning settings as follows:

  • Teaching by demonstration: In this stage, we train the SLM by using AgentInstruct to demonstrate problems and their solutions.
  • Practice and feedback: We let the SLM practice solving problems on its own. For every problem, we allow the SLM to create multiple solutions. We then use the teacher model to provide feedback on the SLM solutions. If the SLM is unable to correctly solve the problem, even after multiple attempts, we use a solution provided by the teacher.
  • Iterative improvement: We use the teacher feedback to create preference data showing the SLM both good and bad solutions to the same problem, and then retrain the SLM.

The practice, feedback, and iterative improvement steps can be repeated multiple times.

Conclusion

Our findings show that smaller models are valuable in specialized settings where they can match the performance of much larger models but with a limited scope. By training Orca-Math on a small dataset of 200,000 math problems, we have achieved performance levels that rival or surpass those of much larger models.

The relatively small size of the dataset also shows the potential of using multi-agent flows to simulate the process of data and feedback generation. The small dataset size has implications for the cost of training and highlights that training data with richer learning signals can improve the efficiency of the learning process. Our findings also highlight the potential of continual learning and the improvement of language models, where the model iteratively improves as it receives more feedback from a person or another model.

The post Orca-Math: Demonstrating the potential of SLMs with model specialization appeared first on Microsoft Research.

]]>
ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics https://www.microsoft.com/en-us/research/blog/visnet-a-general-molecular-geometry-modeling-framework-for-predicting-molecular-properties-and-simulating-molecular-dynamics/ Thu, 29 Feb 2024 20:00:00 +0000 https://www.microsoft.com/en-us/research/?p=1002609 Molecular geometry modeling is a powerful tool for understanding the intricate relationships between molecular structure and biological activity – a field known as structure-activity relationships (SAR). The main premise of SAR is that the biological activity of a molecule is dictated by its specific chemical structure, not only the connections between nuclei but also how […]

The post ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics appeared first on Microsoft Research.

]]>
Figure 1. The general model architecture of ViSNet. (a) Model sketch of ViSNet. ViSNet embeds the 3D structures of molecules and extracts the geometric information through a series of ViSNet blocks and outputs the molecule properties such as energy, forces, and HOMO-LUMO gap through an output block. (b) Flowchart of one ViSNet Block. One ViSNet block consists of two modules: i) Scalar2Vec, responsible for attaching scalar embeddings to vectors.; ii) Vec2Scalar. The inputs of Scalar2Vec are the node embedding, edge embedding, direction unit and the relative positions between two atoms.

Molecular geometry modeling is a powerful tool for understanding the intricate relationships between molecular structure and biological activity – a field known as structure-activity relationships (SAR). The main premise of SAR is that the biological activity of a molecule is dictated by its specific chemical structure, not only the connections between nuclei but also how the molecule is twisted and arranged in a three-dimensional configuration. The holy grail in SAR is to be able to predict how molecular configurations influence vital processes such as drug interactions, chemical reactivity, and protein functionality. If this were possible, scientists could predict the efficacy of a drug, as well as its side effects and toxicity, long before it is ever tested on people.

The vector-scalar interactive graph neural network (ViSNet) framework, developed by Microsoft, is a novel approach to molecular geometry modeling. ViSNet is designed to help researchers predict molecular properties, simulate molecular dynamics, and gain a more precise understanding of structure-activity relationships. As a result, ViSNet has the potential to help transform drug discovery, materials science, and other critical fields.

Our research aims to improve the interpretability of molecular data, reduce computing costs, and evaluate real-world application utility. “Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing” was published in Nature Communications (opens in new tab) in January 2024 and selected for “Editors’ Highlights” in both the “AI and machine learning (opens in new tab)” and “biotechnology and method (opens in new tab)” categories.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Geometry deep learning and SAR

Geometry deep learning is a method at the forefront of SAR investigations: a powerful computational approach that harnesses the power of deep-learning techniques to analyze and understand the three-dimensional structures of molecules. Traditional deep-learning methods primarily focus on processing data organized in grid-like structures, such as images or sequences of text. However, molecules are inherently three-dimensional entities with complex geometries, making them challenging to analyze using conventional deep-learning approaches. Geometry deep learning addresses this challenge by building specialized architectures and algorithms capable of handling three-dimensional data. These methods enable computers to learn and extract meaningful features from the spatial arrangement of atoms within molecules, capturing crucial information about their structure and behavior. 

Despite significant recent strides in geometry deep learning, however, challenges persist. These include:  

  • Insufficient molecular interpretability – We are limited in our ability to understand and interpret the inner workings of deep neural networks when applied to molecular geometry modeling. While these networks excel at making predictions based on large datasets and complex patterns, they often operate as “black boxes,” meaning the rationale behind their predictions isn’t always understandable or transparent. In the context of molecular geometry, this lack of interpretability poses challenges in comprehending why certain molecular structures lead to specific outcomes, such as biological activity or chemical reactivity. 
  • Rapidly increasing computing costs as molecular size increases – As molecules increase in size and complexity, the computational resources required to analyze them escalate dramatically. This challenge becomes particularly pronounced when employing advanced computational techniques, such as those using high-order Clebsch–Gordan coefficients. The Clebsch–Gordan coefficients are mathematical quantities used in quantum mechanics to describe the coupling of the angular momentum properties of particles. In the context of molecular modeling, these coefficients are employed in sophisticated quantum mechanical calculations to help account for the interactions between electrons and nuclei within a molecule. For large molecules, the number of atoms and electrons involved increases exponentially, resulting in an astronomical number of possible interactions that must be considered. As a result, the calculations involving high-order Clebsch–Gordan coefficients become tremendously complex and computationally demanding. 
  • Need for blind tests and evaluations in real applications – Assessing predictive models in real-world applications through blind tests is crucial for evaluating their reliability and applicability beyond controlled benchmarks. However, challenges arise due to the scarcity of diverse and representative datasets, and complex system dynamics. There are also ethical considerations in animal and human trials, which naturally restrict the availability of such data. Overcoming these challenges requires interdisciplinary collaboration, innovative methodologies, and transparent validation frameworks to ensure the robustness and trustworthiness of predictive models in addressing real-world challenges. 

Enhancing molecular geometry representations by ViSNet 

Originally, our goal was to develop a model capable of effectively harnessing the intricate structures of molecules. Traditional molecular dynamics (MD) simulations track molecular movements by considering factors like bond length, bond angle, and dihedral angles. Taking inspiration from these methods, we introduced a novel approach called the vector-scalar interactive graph neural network (ViSNet).

Instead of directly integrating bond angle and dihedral information into our model in a straightforward manner, we introduced a concept termed “direction units.” These units represent nodes within the molecular structure as vectors, calculated by summing normalized vectors pointing from the central node to its neighboring nodes. We expanded traditional calculations of bond length, bond angle, and dihedral angles into interactions involving pairs of atoms (two-body), triplets of atoms (three-body), and quadruplets of atoms (four-body). To efficiently manage these interactions, we devised a runtime geometry calculation (RGC) module, which accurately captures the complex relationships between atoms in a molecule. Remarkably, the RGC module’s computations for three-body and four-body interactions exhibit linear time complexity, ensuring computational efficiency.   

Additionally, we introduced a mechanism known as vector-scalar interactive message passing (ViS-MP), facilitating the exchange of information between nodes and edges in the molecular graph. This mechanism iteratively updates the direction units of nodes based on scalar representations of nodes and edges, and vice versa, through the RGC module. These distinctive features of the RGC and ViS-MP significantly enhance our model’s capacity to encode molecular geometry and streamline the process of information exchange within the molecular graph neural network.

Figure 1. The general model architecture of ViSNet. (a) Model sketch of ViSNet. ViSNet embeds the 3D structures of molecules and extracts the geometric information through a series of ViSNet blocks and outputs the molecule properties such as energy, forces, and HOMO-LUMO gap through an output block. (b) Flowchart of one ViSNet Block. One ViSNet block consists of two modules: i) Scalar2Vec, responsible for attaching scalar embeddings to vectors.; ii) Vec2Scalar. The inputs of Scalar2Vec are the node embedding, edge embedding, direction unit and the relative positions between two atoms.
Figure 1. The general model architecture of ViSNet.

ViSNet in real-world applications for molecular modeling and property predictions

To gauge ViSNet’s practical utility, we rigorously evaluated its performance using established benchmarks for predicting molecular properties. Across a range of datasets, including MD17, revised MD17, MD22, QM9, and Molecule3D, ViSNet consistently outperformed existing algorithms, showcasing its exceptional accuracy in representing molecular geometry.

We then put ViSNet to the test by simulating the behavior of the Chignolin protein through molecular dynamics (MD) simulations. Trained on the AIMD-Chig dataset, featuring protein data calculated using advanced density functional theory (DFT) methods, ViSNet outperformed traditional empirical force fields and showed promise when compared to contemporary machine-learning force fields. Notably, simulations with ViSNet closely mirrored outcomes from rigorous DFT calculations, highlighting its potential for precise and efficient data simulations.

We used ViSNet to participate in the First Global AI Drug Development Competition (opens in new tab), an international competition to predict the inhibitors against the main protease of SARS-CoV-2, given the sequence information (i.e., SMILES) of small molecules. Worldwide, 1,105 participants from 878 teams took part in the competition. ViSNet helped us win the competition, demonstrating its promising prediction accuracy. 

Figure 2. ViSNet in the PyTorch Geometric Library. A PyTorch module that implements the equivariant vector-scalar interactive graph neural network (ViSNet) from the “Enhancing Geometric Representations for Molecules with Equivariant Vector-Scalar Interactive Message Passing” paper.
Figure 2. ViSNet in the PyTorch Geometric Library.

To make ViSNet more accessible and user-friendly, Microsoft has integrated it into the PyTorch Geometric Library (opens in new tab) as a core model for molecular modeling and property prediction. This integration aims to broaden the scope of applications and simplify the usage of ViSNet for researchers and practitioners. Additionally, to ensure ongoing support and improvement, a regularly updated version of ViSNet is now available on GitHub (opens in new tab), providing users with the latest enhancements.

Recognizing the potential limitations of graph neural networks, such as the risk of “over-smoothing” (i.e., making nodes indistinguishable from one another) as models grow larger and more complex, we developed a Transformer-based version of ViSNet known as Geoformer (short for Geometric Transformer). This novel variant, introduced in our publication at NeurIPS 2023 (opens in new tab), addresses scalability challenges by transferring the key components of ViSNet into the Transformer architecture. This includes incorporating the RGC module into the Transformer attention mechanism and introducing a new method called interatomic positional encoding (IPE) to capture spatial relationships between atoms.

Figure 3. The overall pipeline of AI2BMD (see demos at https://microsoft.github.io/AI2BMD/index.html).  Proteins are divided into protein units by fragmentation process. The AI2BMD potential is designed based on ViSNet, and the datasets are generated at DFT level. It calculates the energy and atomic forces for the whole protein. The AI2BMD simulation system is built upon all these components and provides a generalizable solution to perform simulations for various proteins. It makes ab initio accuracy on energy and force calculations. By comprehensive analysis from both kinetics and thermodynamics, AI2BMD exhibits good alignments with wet-lab experiment data and detects different phenomenon compared with molecular mechanics.
Figure 3. The overall pipeline of AI2BMD (see demos at https://microsoft.github.io/AI2BMD/index.html (opens in new tab)). 

Looking forward: Toward AI-powered MD simulations with ab initio accuracy

As a crucial component of the AI-powered Ab Initio Molecular Dynamics (AI2BMD) project (opens in new tab), ViSNet plays a pivotal role in accelerating molecular dynamics simulations. The project’s primary objective is to enhance the accuracy and efficiency of these simulations, with the aim of achieving results comparable to those obtained through rigorous ab initio methods, even for large molecular systems. 

By integrating ViSNet into AI2BMD, significant strides have been made toward achieving this goal. ViSNet enables AI2BMD to achieve levels of accuracy in energy and force calculations that closely approach those of ab initio methods, even for complex proteins containing over 10,000 atoms. By leveraging ViSNet in protein dynamics simulations, AI2BMD aims to enhance the precision of free energy estimations and provide valuable insights into protein folding thermodynamics. 

ViSNet’s contributions extend beyond energy calculations to the characterization of various protein properties. These insights have the potential to complement experimental research efforts by offering predictive capabilities and guiding further investigations into protein structure and function. The advancements in molecular geometry modeling, demonstrated by the innovative ViSNet framework, portend a new era of precision and efficiency in computational chemistry and biophysics.  

Through meticulous design and rigorous validation, ViSNet has emerged as a versatile tool capable of giving insight into the intricate relationships between molecular structure and biological activity – getting us one step closer to the holy grail of structure-activity relationships. The integration of ViSNet into established libraries and frameworks, coupled with ongoing research efforts to enhance scalability and accuracy, underscores its potential to revolutionize drug discovery, materials science, and more.

The post ViSNet: A general molecular geometry modeling framework for predicting molecular properties and simulating molecular dynamics appeared first on Microsoft Research.

]]>
Abstracts: February 29, 2024 https://www.microsoft.com/en-us/research/podcast/abstracts-february-29-2024/ Thu, 29 Feb 2024 14:00:00 +0000 https://www.microsoft.com/en-us/research/?p=1009941 Can how we think about our thinking help us better incorporate generative AI in our lives & work? Explore metacognition’s potential to improve the tech’s usability on “Abstracts,” then sign up for Microsoft Research Forum for more on this & other AI work.

The post Abstracts: February 29, 2024 appeared first on Microsoft Research.

]]>
MSR Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Behavioral Science Researcher Lev Tankelevitch joins host Gretchen Huizinga to discuss “The Metacognitive Demands and Opportunities of Generative AI.” In their paper, Tankelevitch and his coauthors propose using the scientific study of how people monitor, understand, and adapt their thinking to address common challenges of incorporating generative AI into life and work—from crafting effective prompts to determining the value of AI-generated outputs.  

To learn more about the paper and related topics, register for Microsoft Research Forum (opens in new tab), a series of panel discussions and lightning talks around science and technology research in the era of general AI.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.  

[MUSIC FADES] 

Today, I’m talking to Dr. Lev Tankelevitch, a senior behavioral science researcher from Microsoft Research. Dr. Tankelevitch is coauthor of a paper called “The Metacognitive Demands and Opportunities of Generative AI,” and you can read this paper now on arXiv. Lev, thanks for joining us on Abstracts

LEV TANKELEVITCH: Thanks for having me. 

HUIZINGA: So in just a couple sentences—a metacognitive elevator pitch, if you will—tell us about the issue or problem your paper addresses and, more importantly, why we should care about it. 

TANKELEVITCH: Sure. So as generative AI has, sort of, rolled out over the last year or two, we’ve seen some user studies come out, and as we read these studies, we noticed there are a lot of challenges that people face with these tools. So people really struggle with, you know, writing prompts for systems like Copilot or ChatGPT. For example, they don’t even know really where to start, or they don’t know how to convert an idea they have in their head into, like, clear instructions for these systems. If they’re, sort of, working in a field that maybe they’re less familiar with, like a new programming language, and they get an output from these systems, they’re not really sure if it’s right or not. And then, sort of, more broadly, they don’t really know how to fit these systems into their workflows. And so we’ve noticed all these challenges, sort of, arise, and some of them relate to, sort of, the unique features of generative AI, and some relate to the design of these systems. But basically, we started to, sort of, look at these challenges, and try to understand what’s going on—how can we make sense of them in a more coherent way and actually build systems that really augment people and their capabilities rather than, sort of, posing these challenges? 

HUIZINGA: Right. So let’s talk a little bit about the related research that you’re building on here and what unique insights or directions your paper adds to the literature. 

TANKELEVITCH: So as I mentioned, we were reading all these different user studies that were, sort of, testing different prototypes or existing systems like ChatGPT or GitHub Copilot, and we noticed different patterns emerging, and we noticed that the same kinds of challenges were cropping up. But there weren’t any, sort of, clear coherent explanations that tied all these things together. And in general, I’d say that human-computer interaction research, which is where a lot of these papers are coming out from, it’s really about building prototypes, testing them quickly, exploring things in an open-ended way. And so we thought that there was an opportunity to step back and to try to see how we can understand these patterns from a more theory-driven perspective. And so, with that in mind, one perspective that became clearly relevant to this problem is that of metacognition, which is this idea of “thinking about thinking” or how we, sort of, monitor our cognition or our thinking and then control our cognition and thinking. And so we thought there was really an opportunity here to take this set of theories and research findings from psychology and cognitive science on metacognition and see how they can apply to understanding these usability challenges of generative AI systems. 

HUIZINGA: Yeah. Well, this paper isn’t a traditional report on empirical research as many of the papers on this podcast are. So how would you characterize the approach you chose and why?

TANKELEVITCH: So the way that we got into this, working on this project, it was, it was quite organic. So we were looking at these user studies, and we noticed these challenges emerging, and we really tried to figure out how we can make sense of them. And so it occurred to us that metacognition is really quite relevant. And so what we did was we then dove into the metacognition research from psychology and cognitive science to really understand what are the latest theories, what are the latest research findings, how could we understand what’s known about that from that perspective, from that, sort of, fundamental research, and then go back to the user studies that we saw in human-computer interaction and see how those ideas can apply there. And so we did this, sort of, in an iterative way until we realized that we really have something to work with here. We can really apply a somewhat coherent framework onto these, sort of, disparate set of findings not only to understand these usability challenges but then also to actually propose directions for new design and research explorations to build better systems that support people’s metacognition. 

HUIZINGA: So, Lev, given the purpose of your paper, what are the major takeaways for your readers, and how did you present them in the paper? 

TANKELEVITCH: So I think the key, sort of, fundamental point is that the perspective of metacognition is really valuable for understanding the usability challenges of generative AI and potentially designing new systems that support metacognition. And so one analogy that we thought was really useful here is of a manager delegating tasks to a team. And so a manager has to determine, you know, what is their goal in their work? What are the different subgoals that that goal breaks down into? How can you communicate those goals clearly to a team, right? Then how do you assess your team’s outputs? And then how do you actually adjust your strategy accordingly as the team works in an iterative fashion? And then at a higher level, you have to really know how to—actually what to delegate to your team and how you might want to delegate that. And so we realized that working with generative AI really parallels these different aspects of what a manager does, right. So when people have to write a prompt initially, they really have to have self-awareness of their task goals. What are you actually trying to achieve? How does that translate into different subtasks? And how do you verbalize that to a system in a way that system understands? You might then get an output and you need to iterate on that output. So then you need to really think about, what is your level of confidence in your prompting ability? So is your prompting the main reason why the output isn’t maybe as satisfactory as you want, or is it something to do with the system? Then you actually might get the output [you’re] happy with, but you’re not really sure if you should fully rely on it because maybe it’s an area that is outside of your domain of expertise. And so then you need to maintain an appropriate level of confidence, right? Either to verify that output further or decide not to rely on it, for example. And then at a, sort of, broader level, this is about the question of task delegation. So this requires having self-awareness of the applicability of generative AI to your workflows and maintaining an appropriate level of confidence in completing tasks manually or relying on generative AI. For example, whether it’s worth it for you to actually learn how to work with generative AI more effectively. And then finally, it requires, sort of, metacognitive flexibility to adapt your workflows as you work with these tools. So are there some tasks where the way that you’re working with them is, sort of, slowing you down in specific ways? So being able to recognize that and then change your strategies as necessary really requires metacognitive flexibility. So that was, sort of, one key half of our findings.  

And then beyond that we really thought about how we can use this perspective of metacognition to design better systems. And so one, sort of, general direction is really about supporting people’s metacognition. So we know from research from cognitive science and psychology that we can actually design interventions to improve people’s metacognition in a lasting and effective way. And so similarly, we can design systems that support people’s metacognition. For example, systems that support people in planning their tasks as they actually craft prompts. We can support people in actually reflecting on their confidence in their prompting ability or in assessing the output that they see. And so this relates a little bit to AI acting as a coach for you, which is an idea that the Microsoft Research New York City team came up with. So this is Jake Hofman, David Rothschild, and Dan Goldstein. And so, in this way, generative AI systems can really help you reflect as a coach and understand whether you have the right level of confidence in assessing output or crafting prompts and so on. And then similarly, at a higher level, they can help you manage your workflows, so helping you reflect on whether generative AI is really working for you in certain tasks or whether you can adapt your strategy in certain ways. And likewise, this relates also to explanations about AI, so how you can actually design systems that are explainable to users in a way that helps them achieve their goals? And explainability can be thought about as a way to actually reduce the metacognitive demand because you’re, sort of, explaining things in a way to people that they don’t have to keep in their mind and have to think about, and that, sort of, improves their confidence. It can help them improve their confidence or calibrate their confidence in their ability to assess outputs. 

HUIZINGA: Talk for a minute about real-world impact of this research. And by that, I mean, who does it help most and how? Who’s your main audience for this right now?

TANKELEVITCH: In a sense, this is very broadly applicable. It’s really about designing systems that people can interact with in any domain and in any context. But I think, given how generative AI has rolled out in the world today, I mean, a lot of the focus has been on productivity and workflows. And so this is a really well-defined, clear area where there is an opportunity to actually help people achieve more and stay in control and actually be more intentional and be more aligned with their goals. And so this is, this is an approach where not only can we go beyond, sort of, automating specific tasks but actually use these systems to help people clarify their goals and track with them in a more effective way. And so knowledge workers are an obvious, sort of, use case or an obvious area where this is really relevant because they work in a complex system where a lot of the work is, sort of, diffused and spread across collaborations and artifacts and softwares and different ways of working. And so a lot of things are, sort of, lost or made difficult by that complexity. And so systems, um, that are flexible and help people actually reflect on what they want to achieve can really have a big impact here. 

HUIZINGA: Mm-hmm. Are you a little bit upstream of that even now in the sense that this is a “research direction” kind of paper. I noticed that as I read it, I felt like this was how researchers can begin to think about what they’re doing and how that will help downstream from that. 

TANKELEVITCH: Yes. That’s exactly right. So this is really about, we hope, unlocking a new direction of research and design where we take this perspective of metacognition—of how we can help people think more clearly and, sort of, monitor and control their own cognition—and design systems to help them do that. And in the paper, there’s a whole list of different questions, both fundamental research questions to understand in more depth how metacognition plays a role in human-AI interaction when people work with generative AI systems but also how we can then actually design new interventions or new systems that actually support people’s metacognition. And so there’s a lot of work to do in this, and we hope that, sort of, inspires a lot of further research, and we’re certainly planning to do a lot more follow-up research. 

HUIZINGA: Yeah. So I always ask, if there was just one thing that you wanted our listeners to take away from this work, a sort of golden nugget, what would it be? 

TANKELEVITCH: I mean, I’d say that if we really want generative AI to be about augmenting human agency, then I think we need to focus on understanding how people think and behave in their real-world context and design for that. And so I think specifically, the real potential of generative AI here, as I was saying, is not just to automate a bunch of tasks but really to help people clarify their intentions and goals and act in line with them. And so, in a way, it’s kind of about building tools for thought, which was the real vision of the early pioneers of computing. And so I hope that this, kind of, goes back to that original idea.

HUIZINGA: You mentioned this short list of open research questions in the field, along with a list of suggested interventions. You’ve, sort of, curated that for your readers at the end of the paper. But give our audience a little overview of that and how those questions inform your own research agenda coming up next. 

TANKELEVITCH: Sure. So on the, sort of, fundamental research side of things, there are a lot of questions around how, for example, self-confidence that people have plays a role in their interactions with generative AI systems. So this could be self-confidence in their ability to prompt these systems. And so that is one interesting research question. What is the role of confidence and calibrating one’s confidence in prompting? And then similarly, on the, sort of, output evaluation side, when you get an output from generative AI, how do you calibrate your confidence in assessing that output, right, especially if it’s in an area where maybe you’re less familiar with? And so there’s these interesting, nuanced questions around self-confidence that are really interesting, and we’re actually exploring this in a new study. This is part of the AI, Cognition, and [the] Economy pilot project. So this is a collaboration that we’re running with Dr. Clara Colombatto, who’s a researcher in University of Waterloo and University College London, and we’re essentially designing a study where we’re trying to understand people’s confidence in themselves, in their planning ability, and in working with AI systems to do planning together, and how that influences their reliance on the output of generative AI systems. 

[MUSIC PLAYS] 

HUIZINGA: Well, Lev Tankelevitch, thank you for joining us today, and to our listeners, thanks for tuning in. If you want to read the full paper on metacognition and generative AI, you can find a link at aka.ms/abstracts, or you can read it on arXiv. Also, Lev will be speaking about this work at the upcoming Microsoft Research Forum, and you can register for this series of events at researchforum.microsoft.com. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: February 29, 2024 appeared first on Microsoft Research.

]]>
Structured knowledge from LLMs improves prompt learning for visual language models https://www.microsoft.com/en-us/research/blog/structured-knowledge-from-llms-improves-prompt-learning-for-visual-language-models/ Tue, 27 Feb 2024 17:00:00 +0000 https://www.microsoft.com/en-us/research/?p=1009260 Using LLMs to create structured graphs of image descriptors can enhance the images generated by visual language models. Learn how structured knowledge can improve prompt tuning for both visual and language comprehension.

The post Structured knowledge from LLMs improves prompt learning for visual language models appeared first on Microsoft Research.

]]>
This research paper was presented at the 38th Annual AAAI Conference on Artificial Intelligence (opens in new tab) (AAAI-24), the premier forum for advancing understanding of intelligence and its implementation in machines.

First page of the

We’re seeing remarkable abilities from visual language models in transforming text descriptions into images. However, creating high-quality visuals requires crafting precise prompts that capture the relationships among the different image elements, a capability that standard prompts lack. In our paper, “Learning Hierarchical Prompt with Structured Linguistic Knowledge for Language Models,” presented at AAAI-24, we introduce a novel approach using large language models (LLMs) to enhance the images created by visual language models. By creating detailed graphs of image descriptions, we leverage LLMs’ linguistic knowledge to produce richer images, expanding their utility in practical applications. 

An example of three types of prompts used in VLM to recognize bird, which is  templated prompt (a photo of a bird), a natural language based prompt that descript the bird category, and a tree structured prompt highlight the key entities of birds and the corresponding attributes, such as beak, wings, etc.
Figure 1. A structured graph provides descriptions for each class name.

Figure 1 illustrates our method for constructing a structured graph containing key details for each category, or class. These graphs contain structured information, with entities (objects, people, and concepts), attributes (characteristics), and the relationships between them. For example, when defining “water lily,” we include entities like “leaves” or “blooms”, their attributes, “round” and “white”, and then apply LLMs’ reasoning capabilities to identify how these terms relate to each other. This is shown in Figure 2.

The pipeline and instructions to autonomously generate category description and the knowledge graph with LLM. We first instruct the LLM to give a category description, and  then it is asked to parse the key entities, attributes and their relationships from the un-structured  description.
Figure 2. With instructions fed into the LLM, we can receive category-related descriptions along with corresponding structured graphs.

Spotlight: On-demand video

AI Explainer: Foundation models ​and the next era of AI

Explore how the transformer architecture, larger models and more data, and in-context learning have helped advance AI from perception to creation.

How to model structural knowledge

After identifying and structuring the relationships within the generated prompt descriptions, we implement Hierarchical Prompt Tuning (HTP), a new prompt-tuning framework that organizes content hierarchically. This approach allows the visual language model to discern the different levels of information in a prompt, ranging from specific details to broader categories and overarching themes across multiple knowledge domains, as shown in Figure 3. This facilitates the model’s understanding of the connections among these elements, improving its ability to process complex queries across various topics.

The overall framework of the proposed hierarchical prompt tuning.  Descriptions and relationship-guided graphs with class names are used as input for the frozen text encoder and the hierarchical prompted text encoder respectively.
Figure 3. HPT is based on a dual-path asymmetric network, which receives images and various types of text inputs.

Central to this method is a state-of-the-art relationship-guided attention module, designed to help the model identify and analyze the complex interconnections among elements within a graph. This module also understands the interactions between different entities and attributes through a cross-level self-attention mechanism. Self-attention enables the model to assess and prioritize various parts of the input data—here, the graph—according to their relevance. “Cross-level” self-attention extends this capability across various semantic layers within the graph, allowing the model to examine relationships at multiple levels of abstraction. This feature helps the model to discern the interrelations of prompts (or input commands/questions) across these various levels, helping it gain a deeper understanding of the categories or concepts.

Our findings offer valuable insights into a more effective approach to navigating and understanding complex linguistic data, improving the model’s knowledge discovery and decision-making processes. Building on these advances, we refined the traditional approach to text encoding by introducing a hierarchical, prompted text encoder, shown in Figure 4. Our aim is to improve how textual information is aligned or correlated with visual data, a necessity for vision-language models that must interpret both text and visual inputs.

Frameowork of the hierarchical prompted text encoder, where we apply three types of prompts, low-level prompts, high-level prompts, and global-level prompts for hierarchical tuning, and design a relationship-guided attention module for better modeling structure knowledge.
Figure 4. A hierarchical-prompted text encoder learns from multi-level prompts, with a relationship-guided attention module for modeling structural knowledge.

Looking ahead

By incorporating structured knowledge into our model training frameworks, our research lays the groundwork for more sophisticated applications. One example is enhanced image captioning, where visual language models gain the ability to describe the contents of photographs, illustrations, or any visual media with greater accuracy and depth. This improvement could significantly benefit various applications, such as assisting visually impaired users. Additionally, we envision advances in text-to-image generation, enabling visual language models to produce visual representations that are more precise, detailed, and contextually relevant based on textual descriptions.

Looking forward, we hope our research ignites a broader interest in exploring the role of structured knowledge in improving prompt tuning for both visual and language comprehension. This exploration is expected to extend the use of these models beyond basic classification tasks—where models categorize or label data—towards enabling more nuanced and accurate interactions between people and AI systems. By doing so, we pave the way for AI systems to more effectively interpret the complexities of human language.

Acknowledgements

Thank you to Yubin Wang for his contributions in implementing the algorithm and executing the experiments.

The post Structured knowledge from LLMs improves prompt learning for visual language models appeared first on Microsoft Research.

]]>