This is a submission for the Google AI Studio Multimodal Challenge
What I Built
The Problem It Solves
Many people face the daily challenge of looking at a collection of ingredients in their fridge or pantry and feeling uninspired or unsure of what to make. This often leads to food waste or repetitive meals. Traditional recipe searches require users to manually type in ingredients, which can be tedious and may not capture everything available.
The Experience It Creates
The Visual Recipe Assistant creates a seamless and intuitive experience to combat this problem:
Effortless Inspiration: Instead of typing, you simply snap a photo of your ingredients. The app takes this visual input and instantly provides you with complete, ready-to-make recipes. This removes the mental friction of meal planning and makes cooking more spontaneous and fun.
AI-Powered Culinary Creativity: The applet showcases the power of the Gemini API’s multimodal understanding. It intelligently identifies various food items from an image and generates creative, relevant recipes complete with instructions, serving sizes, and even estimated nutritional information.
Reduces Food Waste: By suggesting recipes based on what you actually have, the app encourages you to use up ingredients before they spoil, promoting a more sustainable kitchen.
A Personalized Digital Cookbook: With the ability to save your favorite generated recipes, the app becomes a personal, ever-growing cookbook. The “Saved Recipes” feature ensures that you can easily revisit meals you enjoyed, building a curated collection tailored to your tastes and pantry staples.
In essence, the Visual Recipe Assistant transforms your phone’s camera into a smart culinary partner, making meal discovery effortless, reducing food waste, and empowering you to be more creative in the kitchen.
Demo
How I Used Google AI Studio
This application is a prime example of leveraging the powerful multimodal capabilities of the Google Gemini API, the same technology that powers Google AI Studio. Here’s a breakdown of how it was implemented:
- Core Multimodal Capability: Fusing Image and Text Input
The central feature of this app is its ability to understand and reason from two different types of input simultaneously: an image and a text prompt. This is a core strength of the Gemini models.
Image Input (ImagePart): The user provides a photograph of their ingredients. This is the visual context. The gemini-2.5-flash model doesn’t just see pixels; it performs sophisticated object recognition to identify the items as “tomatoes,” “onions,” “pasta,” “herbs,” etc. This is the “what do I have?” part of the equation.
Text Input (TextPart): The image alone isn’t enough. I pair the visual data with a carefully crafted text prompt:
“Based on the ingredients in this image, suggest up to 3 simple recipes. For each recipe, provide the recipe name, a list of ingredients with quantities, step-by-step instructions, the serving size, and estimated nutritional information (calories, protein, carbohydrates, and fats).”
This prompt gives the model its instructions—the “what should I do with this information?” part. It directs the model to act as a creative chef and to structure its response in a very specific way.
The synergy of these two modalities allows the model to perform a complex task: it looks at the image, identifies the ingredients, and then uses that list as the basis for a creative text-generation task defined by the prompt. - Leveraging an Advanced AI Studio Feature: Structured Output (JSON Schema)
A major challenge when working with large language models is getting consistently formatted output that can be easily used in an application. Getting back a plain block of text would require fragile and error-prone string parsing.
To solve this, I leveraged one of the most powerful features available through the Gemini API, which you can also configure in AI Studio: Structured Output.
responseMimeType: ‘application/json’: This tells the model that I expect the final output to be a valid JSON string.
responseSchema: This is the most critical part. I provide the model with a detailed JSON schema that defines the exact structure of the data I want. I specified that the output should be an ARRAY of OBJECTs, where each object must contain:
recipeName (a STRING)
ingredients (an ARRAY of STRINGs)
instructions (an ARRAY of STRINGs)
servingSize (a STRING)
nutritionalInfo (an OBJECT with specific string properties for calories, protein, etc.)
By defining this schema, I force the model to organize its creative output into a predictable, machine-readable format. This eliminates the need for manual parsing and makes the integration between the AI response and the user interface seamless and robust. The application can directly take the JSON response, parse it, and render the recipe cards.
In summary, this applet uses multimodal input (image + text) to understand a user’s real-world context and leverages structured output (JSON schema) to transform the AI’s creative response into reliable data that powers a dynamic and user-friendly experience.
Multimodal Features
The specific multimodal functionality I built is the core of this application: it fuses visual input (an image of ingredients) with a detailed text prompt to generate structured JSON data (recipes). This is a powerful combination that significantly enhances the user experience in several ways.
The Multimodal Functionality Breakdown:
Visual Understanding (Image Input): The user provides a photo of their available ingredients. The gemini-2.5-flash model leverages its sophisticated computer vision capabilities to identify the individual food items in the image. It doesn’t just see a picture; it understands “these are tomatoes, that’s an onion, I see a box of pasta.” This acts as the factual, real-world context for the request.
Instructional Context (Text Input): The image alone is just data. The user’s intent is provided through a carefully crafted text prompt that is sent simultaneously with the image. The prompt instructs the model to act as a recipe generator, specifying the desired output: “suggest up to 3 simple recipes… provide the recipe name, a list of ingredients with quantities, step-by-step instructions, serving size, and estimated nutritional information.”
Structured Output (JSON): A key part of the implementation is forcing the model’s response into a specific modality—structured application/json. By providing a responseSchema, the AI’s creative text and numerical data are organized into a clean, predictable format that the application can immediately parse and render into UI components.
Why It Enhances the User Experience:
Intuitive and Effortless Interaction: The primary benefit is a massive reduction in friction. Instead of the tedious task of manually typing out a list of ingredients, the user performs a simple, natural action: taking a photo. This mimics asking a friend, “What can I make with this?” It’s faster, more engaging, and feels almost magical.
Solves a Practical, Real-World Problem: This functionality directly addresses the common “what’s for dinner?” dilemma. By starting with the user’s actual inventory, the generated recipes are immediately actionable and relevant. This helps reduce food waste and encourages creativity with ingredients that might otherwise be overlooked.
Creates a Reliable and Polished UI: By combining the multimodal input with a strict JSON output schema, the application avoids the pitfalls of parsing messy, unstructured text. This ensures that the generated recipes are always displayed in a clean, consistent, and easy-to-read format. The UI is robust and professional because the AI’s output is tailored to its specific needs, which is a superior user experience compared to displaying a raw block of text.
In essence, this multimodal approach transforms the user’s phone camera from a simple image-capture device into a powerful culinary assistant, turning a snapshot of their kitchen counter into a personalized, actionable meal plan.